Princeton researchers tackle reproducibility in machine learning

Dec. 21, 2022

In recent years, scientists have noticed that conclusions in some published research that heavily use machine learning cannot be reproduced.

To uncover why this is happening, Sayash Kapoor, a computer science doctoral student affiliated with the Center for Information Technology Policy (CITP), and Arvind Narayanan, professor of computer science, a participating faculty member of the Center for Statistics and Machine Learning (CSML) and CITP associated faculty, published the paper, “Leakage and the Reproducibility Crisis in ML-based Science.”

The paper compiled a list of 20 reports from 17 fields that collate or highlight “reproducibility failures or pitfalls” in machine learning-based science research, thus impacting 329 papers.

Kapoor and Narayanan wrote that it appears that data leakage is the biggest cause for these experiments’ inability to be reproduced. Data leakage is when information outside a data set is used to generate a machine learning model. While machine learning researchers have been looking at the reproducibility issue overall, there has not been much investigation of data leakage in machine learning-based scientific research - the goal of Kapoor and Narayanan’s paper.

To that end, the researchers presented “a fine-grained taxonomy of eight types of leakage that range from textbook errors to open research problems.” The researchers also created a framework called “model sheets” for detecting leakage in machine learning-based science papers before they go to print. The model sheets, based on the eight types of data leakage, would serve as additional material to the paper and would provide transparent, detailed information on the model.

To prove the model sheet’s efficacy, the researchers looked at studies that purport to predict civil wars and used model sheets to point out instances of data leakage. 

Workshop on Reproducibility

To further highlight reproducibility and suggest changes in science research, Princeton scholars held a one-day workshop, “The Reproducibility Crisis in ML-based Science,” on July 28, which drew ten speakers from Princeton, other universities, the nonprofit world and industry. The event was hosted by CSML and DataX, an initiative that aims to spread and deepen artificial intelligence and machine learning across campus to speed scientific discovery. CSML participates in this initiative.

Each session of the event is available at this link. According to an article from CITP, “more than 1,700 people from 500 institutions and 30 countries had registered” for the workshop and nearly 600 people watched live via Zoom and YouTube combined, a testament to the wide interest in the topic.

“The key thing is that a dozen fields are independently discovering these issues,” said Narayanan referring to the online list of papers collating machine learning failures. “It’s clear that some systematic intervention is needed.”

Narayanan, along with a cohort of graduate and undergraduate students from Princeton, Cornell, and Northwestern universities, organized the workshop. 

While opening the July event, Narayanan proposed a hypothesis on the cause of the reproducibility issue: the pressure to publish research coupled with machine learning’s “sharp edges” have led to insufficient rigorous scientific practices to handle machine learning. Also, in most cases, researchers can access test labels, and errors may lead to overestimation in data results.

The workshop was divided into three sessions: Diagnose, Fix and Future Paths. And each session was followed by a panel led by a moderator.

The first session had Michael Roberts, senior research associate at the University of Cambridge; Gilles Vandewiele, a postdoc researcher from Ghent University; and Odd Erik Gundersen, associate professor at Norwegian University of Science and Technology. 

During his talk, Vandewiele talked about how recent studies have “reported near-perfect results” on predicting preterm or term births of pregnancies, using electrohysterography, a technique that reads electrical activity in the uterus. However, he and a team of researchers looked over these studies and concluded that the results were “overly optimistic” due to flaws in the data analysis process. The team specifically looked at one methodological flaw: applying over-sampling before partitioning the data into mutually exclusive training and testing sets, thus leading to biased results. This work was detailed in a 2020 paper that Vandewiele co-authored: “Overly Optimistic Prediction Results on Imbalanced Data: a Case Study of Flaws and Benefits when Applying Over-sampling.”

The second session had Michael Lones, associate professor at Heriot-Watt University; Marta Serra-Garcia, associate professor at the University of San Diego, California; Momin Malik, senior data science analyst at the Mayo Clinic; and Inioluwa Deborah Raji, a current fellow at the Mozilla Foundation.

Lones presented a talk, “How to Avoid Machine Learning Pitfalls: A Guide for Academic Researchers,” which covers the same ground in his 2021 paper of the same title.

“A lot of mistakes we see in machine learning are caused by the inexperienced,” he said. “Part of the issue is that novice machine learning practitioners often struggle to understand the more established machine learning literature.” 

Lones covered tasks researchers should be doing, such as taking the time to understand a data set, talking to domain experts, considering combinations of models, evaluating a model multiple times, and reporting performance in multiple ways.

The third session featured Jake Hofman, senior principal researcher at Microsoft; Jessica Hullman, the Ginni Rometty Associate Professor of Computer Science at Northwestern University; and Brandon Stewart, associate professor of sociology at Princeton University. 

In his talk, Hofman discussed points he and co-authors made in a 2021 paper in Nature on computational social science and how they can also be applied to machine learning-based scientific research. The paper, “Integrating explanation and prediction in computational social science,” addresses different research approaches in social science and computer science, namely how social scientists often create explanations of human behavior, “often invoking causal mechanisms derived from substantive theory.” This is an “explain culture” that traditionally emphasizes causal effects. Computer scientists are interested in creating accurate predictive models - “a predictive culture” that traditionally emphasizes “predictive performance,” Hofman said.

He showed two examples of the pitfalls of both explain and predictive culture. On the dangers of explain culture, Hofman pointed to an influential 2003 study that argued several economic factors lead to civil wars and not ethnic and religious strife, but after further examining the data, a later study in 2010 concluded that the findings were not accurate. On the negatives of predictive culture, Hofman brought up Google's Flu Trends, which famously failed in predicting the trajectory of the flu in 2013.  

Hofman and his fellow researchers addressed how these two approaches or cultures can complement each other and then be woven into an “integrative model.” In his talk, Hofman said machine learning-based research can draw lessons from that paper by integrating both cultures, using them to critique or check conclusions. This approach should lead to more replicable science, he said. 

“Reproducibility is this cornerstone,” said Hofman. “If we don’t have individual results that are credible, then it’s hard to go anywhere. We want individual results to hold up and become cumulative and build on each other.”