Michael Skinnider: using generative AI to accelerate discovery of small molecules

Written by
Allison Gasparini
Nov. 19, 2024

When the human body breaks down food or drugs or even its own tissue, it produces small molecules called metabolites. Using analytical techniques, researchers can typically detect thousands of small molecules in a sample of human tissue. While many of these molecules may be known, much of the small molecules in a given sample are unidentifiable to researchers for one reason or another. Identifying these unknowns is a question of scientific interest.  

“These could be new biomarkers of disease,” said Michael Skinnider, Assistant Professor, Lewis-Sigler Institute for Integrative Genomics, Princeton University. “They could be new therapeutic targets.”

Despite the importance of metabolic small molecules, there aren’t many scientific tools out there that have the capability to identify the unknown chemical structures. Small molecules are especially diverse, and behave in unpredictable ways, said Skinnider. This makes analyzing and reanalyzing new and existing tissue data containing unknown small molecules difficult. So how can researchers identify the thousands of unknown metabolites which exist in human bodies?

“This is a basic science question that we have very limited ability to solve with existing computational tools,” said Skinnider. “So, we’re developing new computational approaches, many of them based on machine learning, to take this complex metabolomic data and translate it into chemical structures.”

Navigating an enormous space

At a talk given at the Center for Statistics and Machine Learning on Nov. 12 as part of the center’s ongoing Lunchtime Faculty Seminar series, Skinnider discussed how he’s using language models to accelerate the discovery of new small molecules. 

The most current database of known human metabolites contains over 2,000 known small molecules. When accounting for synthetic molecules we come in contact with through things like diets and household products, researchers estimate that the amount of still unknown small molecules present in human tissues can reach into the millions. The space is so enormous that trying to search for every valid chemical structure which may exist becomes incredibly difficult. That’s where machine learning can help.

Scientific figure depicting how machine learning targets new chemical structures for discovery

Figure courtesy of Michael Skinnider.

Skinnider and his colleagues use a large language model trained on the structures of known small molecules in the human body to look for related, but unknown, metabolites. The underlying technology is similar to Skinnider’s previous research, which used generative AI to help anticipate the chemical structures of designer drugs. “The key idea has been to predict the existence of metabolites that are likely to be discovered in the future and then search in a more targeted way for these small molecules,” said Skinnider. 

The research relies on a number of experimental collaborators, including in laboratories on campus such as the lab of Josh Rabinowitz, professor of chemistry and the Lewis-Sigler Institute for Integrative Genomics. Skinnider’s PhD student Hantao (Tony) Qiang spends time in the Rabinowitz wet lab processing tissue samples to collect data for identifying new molecules. Since working with the Rabinowitz lab, Skinnider and his collaborators have experimentally discovered nearly 50 new small molecules in humans and mice.  

“I was surprised at how well this model works and how much it accelerated the pace at which we have been able to discover new metabolites,” said Skinnider. “It’s been really exciting to use these machine learning tools to make discoveries experimentally in the lab.”