DataX progress on chemistry, neuroscience and federated learning research

Written by
Sharon Adarlo
June 15, 2020

Under Princeton University’s DataX Fund, several research teams have been busy exploring how data science tools can speed up scientific discovery in a wide range of disciplines.

From these teams, we profile three projects that have yielded promising results, respectively in creating new processes for chemical compounds, deciphering brain activity, and taking steps to enhance the security of federated machine learning, a technique that relies on decentralized data. These projects, among nine, received funding in November from DataX.

“These projects show how useful data science tools can be,” said Peter Ramadge, the director of the Center for Statistics and Machine Learning (CSML). “Data science is enabling our researchers to conduct research that was once difficult or impossible to do in the past. It’s been exciting to see how data science is helping power research into new areas of inquiry.”

CSML oversees the DataX Fund, which kickstarted in February 2019. The fund’s purpose is to enhance the extent to which data science permeates discovery across campus through various research and educational initiatives. More on the fund can be read here.


Updates on three projects funded by DataX follows:


“Physical Priors for Generative Modeling of Molecular Structures and Interactions”

Ryan Adams, Professor of Computer Science and Director of the Undergraduate Program in Statistics and Machine Learning, and Abigail Doyle, the A. Barton Hepburn Professor of Chemistry

Synthesizing new chemical compounds for the pharmaceutical industry and other industrial sectors can be a difficult, time consuming and expensive process. Machine learning tools can play a role in streamlining this process, said Adams.

“Synthetic chemists often view making new compounds as akin to playing chess,” said Adams. “These days, AI is pretty good at playing chess, so we think machine learning may really be able to impact synthetic chemistry.”

To tackle this problem, Adams and Doyle divided their proposal into two ongoing projects: “Reaction Condition Optimization through Knowledge Transfer” and Discrete Object Generation with Reversible Inductive Construction.”

In the first, the two researchers utilize machine learning techniques to optimize chemical reactions or develop easier pathways to generate desired compounds at high yield. Specifically, the team was able to find optimal reaction conditions for the Buchwald-Hartwig reaction, a chemical reaction that is used to create carbon-nitrogen bonds. Chemists use this reaction to develop new medications. The team plan to apply what they learned to process general chemical reactions.

In the second part of their proposal, the team used machine learning techniques to develop a process to model molecular structures and their chemical interactions in order to create new, useful compounds. This process utilizes a generative modeling framework. They applied their process to learn the ZINC dataset, a collection of 250,000 drug-like molecules. Their experiments showed that their process was able to yield some promising novel compounds that were similar to ones found in ZINC.


“Decoding the Language of the Brain”

Uri Hasson, Professor of Psychology and the Princeton Neuroscience Institute; Karthik Narasimhan, Assistant Professor of Computer Science; Kenneth Norman, the Huo Professor in Computational and Theoretical Neuroscience and Professor of Psychology and Neuroscience

The neural pathways and code that enable human communication are poorly understood. This project aims to decipher that mystery by diving into the inner workings of the human brain and study how our thoughts become words, how we communicate with each other, and our use of language. Simply put, this study aims to translate brain activity to language.

“We want to be able to look at your brain activity and anticipate what you will say,” said Hasson.

The project involved having volunteers wear electrodes on their brains and having researchers record their neural activity for a week as the volunteers conversed with other people.

The volunteers are epileptic patients with implanted intracranial electrodes, which are used by clinicians to detect the epileptic source. In other words, the data are unique because they rely on invasive recording methods not available when we study typical students in the lab,” said Hasson.

From this database of high-quality brain activity - collected during the production and comprehension of natural speech – the researchers utilized machine learning techniques to see what patterns can be uncovered and their links to modes of communication.

The researchers said the project was set up into three parts: First, they developed a variant of the Transformer model, a novel neural network architecture, to decode brain signals into English words. Google released the Transformer model in 2017. The team is now working to improve the accuracy of this model.

Secondly, the team built off work from above to develop another Transformer model to decode longer sequence of words from brain signal inputs by capturing correlations between words produced by a speaker as well as brain signals over time. The researchers said this model would allow them to investigate interesting questions such as, “How far ahead does a person think about what words he or she speaks out.” The researchers are now engaged in improving the model’s performance.

Thirdly, the researchers used machine learning models to predict brain signals from words.

Such results establish links between modern context-based machine learning language models and context dependent neural response to real-life sentences in real-life contexts,” said Hasson.


“Secure and Private Federated Learning”

Prateek Mittal, Associate Professor of Electrical Engineering; H. Vincent Poor, the Michael Henry Strater University Professor of Electrical Engineering

Machine learning algorithms operating on mobile networks can be categorized into three different types. First is the classical, standard approach in which end-user devices send their data to a central server where this data is used to train a model. Second is a distributed setting in which each device trains its own model and send its model parameters to a central server, where these model parameters are aggregated to create one final model. The third (and in contrast) federated machine learning, an emerging technique, allows end-user devices to train and improve models collaboratively from decentralized data by interacting iteratively with a central server or network edge device.

In mobile networks, a federated machine learning-powered algorithm would be running on large numbers of smartphones with the goal of learning from user interactions. The algorithm locally learns from users’ training data on the smartphones and sends model updates to a cloud server or edge device, which performs global model aggregation, while the end user training data never leaves the phone. This process is then repeated iteratively until it converges to a stable model.

Security and privacy vulnerabilities are not well understood in this approach. The researchers sought with this project to use data science techniques to look at security, privacy, and utility issues in federated learning, and develop designs that are more robust.

“We want to know to what extent can malicious users harm these programs, and also how we can make federated machine learning better in blunting those attacks,” said Mittal.

The team has made progress in this project by developing an information-theoretic framework for all three types of aforementioned machine learning paradigms, including federated learning. From this framework, the researchers were able to create a list of fundamental properties for each type such as the bounds on privacy leakage, and the upper and lower bounds on the generalization error of learning, which measures the accuracy of a machine learning algorithm.