Machine learning course reveals its utility in different disciplines

Feb. 28, 2023
Diagram showing unsupervised machine learning

A excerpt from a slideshow presented by Peter Ramadge, CSML director. 

Before 2000, many social scientists avoided quantitative studies of text because it was a time-consuming job, it was difficult to search content for relevant information, and the task didn’t lend itself to being generalizable since each new data set represented unique challenges.

“There was a lot of social interaction occurring in text at that point, but many social scientists didn’t analyze text and speech quantitatively because there wasn’t a sense on how to do it,” said Brandon Stewart, associate professor of sociology at Princeton University.

But machine learning has reduced these challenges and opened new avenues of inquiry in social science research. For example, Stewart has used machine learning to explore and tease meaning out of text in Chinese newspapers.

Stewart gave a presentation on his research during a special Wintersession workshop on January 25th. The workshop, titled “What is Machine Learning, and Can It Aid My Research,” gave attendees a broad overview of machine learning, with guest speakers such as Stewart providing specific instances of how they have used algorithmic-driven tools in their field. The workshop, hosted by the Center for Statistics and Machine Learning (CSML), was geared towards faculty, postdocs and graduate students.

“Machine learning has become increasingly important in research because these tools are well suited in dealing with very large data sets or information that has a high degree of complexity,” said Peter Ramadge, director of CSML and the first speaker at the three-hour workshop. “This workshop is meant to raise awareness about these tools, in addition to delving into topics such as data set curation and software tools.”

Diagram of a convolutional network

Excerpt from a slideshow presented by Peter Ramadge, CSML director.

This is the second year that CSML has offered this Wintersession workshop. The first one was held in 2022, also with Ramadge leading the proceedings.

Ramadage opened this year’s workshop with a high-level, broad view of machine learning. He first introduced the mission of CSML, which is to provide a focal point for statistics and machine learning research and teaching on campus. He then touched on various topics: data sets like IMAGENET and MNIST, training and testing models, unsupervised and supervised learning, deep learning, neural networks, and software such as Scikit-learn, TensorFlow, Keras and Python.

Ramadge also touched on exciting recent developments in machine learning such as a model that creates unconventional chip layouts with better performance metrics. Ramadge also discussed briefly controversial flashpoints in machine learning, especially the large language model ChatGPT.

For Stewart’s presentation, he talked at length about the explosion of unstructured data due to several factors: email, Google’s digitization of books, and online communities leaving digital footprints, among other causes. All present ripe opportunities to use text as data to learn about latent societal trends.

Stewart talked about how he used machine learning to study the media landscape in China and the impact of propaganda. He learned that scripted propaganda from the government’s central office was increasingly making it to the front page of newspapers.

Brian Arnold, a DataX data scientist who works on biomedical data science projects, spoke on his research work which involves cancer cells and the task of unraveling their complex evolution. (DataX is short for Schmidt DataX Fund, which aims to spread and deepen the use of machine learning on the Princeton campus. CSML oversees part of this initiative and has been heavily involved in hiring and mentoring data scientists such as Arnold.)

Machine learning has been helpful in analyzing cancer cells because it can process reams of genetic material quickly, Arnold said. A notable project Arnold worked on is HATCHet or Holistic Allele-specific Tumor Copy-number Heterogeneity, an algorithm that finds and analyzes genes duplicated or deleted in multiple tumor samples from a single cancer patient. Arnold worked on this program with Ben Raphael, professor of computer science, and Vineet Bansal, senior research software engineer jointly appointed to CSML and the Princeton Institute for Computational Science and Engineering.

Peter Melchior, assistant professor jointly appointed to astrophysical sciences and CSML, ended the workshop by talking about his own research using machine learning algorithms to process the large amounts of data coming from cosmological surveys, such as LSST, Euclid and WFIRST. Melchior’s research group develops “techniques for source separation, mixture modeling, and data fusion, using proximal techniques and, increasingly, neural networks.”

Reflecting on the workshop, Ramadge said he hoped attendees got a good look at the basics of machine learning and how and when to use it.

“Research is being advanced due to the deployment of these tools,” said Ramadge. “Hidden insights that would have been very difficult to discover by other means are coming to the forefront, aiding research from social science to astrophysical science.”