DataX – Brian Arnold: using data science to answer questions in biology

Written by
Sharon Adarlo
March 22, 2021

In his past research, Brian Arnold traveled Europe to study a common wildflower, Arabidopsis arenosa, which has white to lavender-colored flowers that resemble violets at a glance and grows on rocky outcrops in the Alps and the Carpathian Mountains.

Arnold studied this plant's evolutionary dynamics, which spawned a race with a duplicated genome and successfully spread across Europe as glaciers melted 20,000 years ago. Using advanced computational methods to study the plant's DNA sequence, Arnold found patterns that identify this genome duplication event's geographic location and the plant's migration routes in its spread across the European continent.

As a data scientist within Princeton University's Schmidt DataX Initiative, Arnold has now switched his focus from plants to human cancer cells and other biological processes. The DataX Initiative aims to spread and deepen artificial intelligence and machine learning across campus to speed scientific discovery. The Center for Statistics and Machine Learning (CSML) participates in this initiative and is involved in hiring and creating a community of data scientists like Arnold at Princeton.

His current project is quite a change from his previous area of study. But Arnold explained similar techniques used to study plants can elucidate how cancer cells evolve and grow and decipher the mechanisms underlying other human diseases and conditions.

"While many plants duplicate their genomes and subsequently thrive, this phenomenon creates cellular instability in mammals and is exceedingly rare, except in human cancer cells. These thrive on instability," said Arnold.

"Moreover, specific evolutionary processes create predictable signatures in genomes, regardless of the species in which the genome originated. Evolutionary biology is highly transferrable across organisms." 

Princeton announced the creation of DataX in February 2019. Its mandate includes hiring data scientists to participate in three research areas: the Princeton Catalysis Initiative, the Center for Information Technology Policy, and biomedical data science. Arnold is part of the latter.

The Department of Computer Science, the Lewis-Sigler Institute for Integrative Genomics (LSI), Princeton Neuroscience Institute, and several engineering departments are all part of the biomedical science initiative. Arnold's contribution to this research endeavor involves managing and interpreting genetic data, which has grown spectacularly in size due to rapid genome-sequencing technologies over the last few years.

"For DataX, I hope to bring my intuition of evolution and my bioinformatic skills, which involves processing and analyzing biological data via standardized techniques," said Arnold. "Many times, these bioinformatic procedures can take enormous amounts of time, especially for those with less experience. Making this process as painless and reproducible as possible for researchers is a major goal of my position." 

"We are pleased that Brian has joined Princeton and DataX," said Peter Ramadge, director of CSML. "His unique, multi-disciplinarian background is well suited to the task we have set out in DataX: enhancing the research capabilities of our faculty and younger scholars and accelerating the use of data science on campus."

Arnold is working with faculty members Ben Raphael, professor of computer science; Olga Troyanskaya, professor of computer science and LSI; and Barbara E. Engelhardt, associate professor of computer science. All three work in the intersection of data science and biology.

"I will initially work with the petabytes of cancer genomes hosted on Google Cloud," Arnold said. "But I will also use modern cloud-computing techniques with a variety of other datasets according to the interests of the labs I will collaborate with."

Before coming to Princeton, Arnold earned his bachelor's degree in plant biology from the University of Minnesota, Twin Cities, in 2008. He then went onto Harvard University, where he earned his doctoral degree in organismic and evolutionary biology in 2015. His dissertation was titled "Evolutionary dynamics of a multiple-ploidy system in Arabidopsis arenosa." 

After earning his Ph.D., Arnold was a visiting scientist at the University of Helsinki and a Ruth L. Kirschstein Postdoctoral Fellow at the Harvard T.H. Chan School of Public Health. In 2018, he was a consultant for the company Day Zero Diagnostics, where he worked on bacterial genomics and transmission dynamics at Massachusetts General Hospital. From 2018 to earlier this year, Arnold was a senior bioinformatics scientist at Harvard University. He taught workshops, collaborated with faculty, and helped students analyze large sequencing datasets using high-performance computing.

"I am excited to be at a new place with lots of brilliant, motivated people because working with enthusiastic scientists is the best part of my job," said Arnold. 

One of his first projects at Princeton's DataX entails building an analytic workflow for researchers to enable the analysis of thousands of cancer genomes on the cloud. Researchers have turned to cloud computing because so much cancer data is already in cloud storage. It would be time-consuming and expensive to download all this data and analyze it.

"I am very excited to work on biomedical data because this field uses all the latest in technologies and techniques," said Arnold. "I'm excited to work with the massive amounts of biomedical data that exist and, with my collaborators, trying to come up with new and creative ways of making sense of it all."