Some parts of Africa are plagued by what policy experts say is a “resource curse,” a term that describes how the continent brims with an abundance of natural resources, from petroleum to valuable minerals, and yet many of these regions suffer from conflict, poverty and poor development.
Saran Toure ’22, a politics major at Princeton University, set out to use data science techniques to examine the resource curse and test if there were connections to colonialism. She focused her lens on West African countries that France colonized and where there are still existing French ties in the way of culture, investments, and sometimes, military intervention.
Toure used techniques such as time series analysis and multilevel linear regression models on various datasets. This work led to her detecting a connection between heavy extractive resource policies during the colonial period to lower literacy and primary school attendance rates in the post-independence era.
“The natural resource curse in the region is usually attributed to weak contemporary institutions, and the inefficient distribution of resource rents. However, this understanding of the resource curse is very ahistorical and does not account for the impacts of French colonial influence,” said Toure, who is originally from Guinea, one of the countries that’s part of her study.
She presented her finding at the Center for Statistics and Machine Learning‘s (CSML) annual undergraduate poster session held earlier this month. Hosted virtually, 124 students participated in the event and hailed from 13 departments and centers, including African American Studies, chemical and biological engineering, and ecology, just to name a few. CSML feted the poster session participants with a celebratory in-person event on May 12th.
The students' projects are a key component of CSML's Undergraduate Certificate Program in Statistics and Machine Learning.
A requirement of the program is an independent project that significantly incorporates data science.
“This year’s poster session showcased the growing reach of data science and machine learning on campus and the diversity of the community. Many different majors were represented in this year’s crop of independent work. The students impressed me with their inventiveness and the questions they tackled,” said Peter J. Ramadge, the CSML director.
By the numbers, the CSML undergraduate program has grown every year. In 2021, 100 students participated in the poster session compared to 124 this year. In 2021, students graduating with the CSML certificate constituted the second largest group on campus after the Program in Applications of Computing, according to Princeton’s Office of the Registrar.
Among the poster session participants, three students received special recognition this year for their projects:
“Equitable Data-Driven Resource Allocation to Fight the Opioid Epidemic: A Mixed-Integer Optimization Approach”
Albert Lin ’23, Department of Computer Science
“Improving Generalization and Interpretability in Reinforcement Learning with Construal Models”
Nobline Yoo ’23, Department of Computer Science
“Building a Tool for Chronicling America: Flexibility and Efficiency in Digital Humanities”
An article profiling all three poster winners and their projects is forthcoming.
“Because the students come from so many different disciplines, we get to see independent works that spans a broad range of use cases. And we saw a lot more projects in economics and social science, than in the past,” said Melchior. “It’s a remarkable development. It shows you don’t have to be in computer science or ORFE to be enrolled in this certificate and do great work.”
Molly Aguina ’22, a sociology major, is one such student. She delved into her interest in health issues by exploring how immunocompromised pediatric cancer patients perceive telehealth, which many doctors utilized during the pandemic. She looked at survey data from two clinics in Cook County, Illinois.
Her project used machine learning techniques to examine if there were connections between socioeconomic status and perceptions of telehealth. Her analysis showed that “there was no statistical difference for perceptions of telemedicine between strata defined by race, insurance type (Medicaid versus private), or years of parental education.”
“It seems that many of the patients liked the telemedicine alternative, and that shows it’s a tool we could use more, going forward,” she said.
Derek Li ’22, a Princeton School of Public and International Affairs major, explored the urban heat crime effect in his project. The term describes a pattern that law enforcement officials have noticed: When the temperature rises, there is more crime.
“I have been very interested in climate-related issues and the impact of rising temperatures on conflict overall,” he said. “I am also interested in seeing how this can impact marginalized, high poverty communities.”
Li pulled data from police departments, publicly available survey on poverty, and the National Oceanic and Atmospheric Administration. After cleaning some of his data, he used linear regression to detect any patterns. He found that every city in his study showed a strong correlation between increasing temperature and an uptick in violent crime except for Phoenix, Arizona.
AJ Kawczynski ’22, a computer science major, combined his love of baseball and statistics in his project, which examined the best time for coaches to pull a starting pitcher in MLB games.
“Pulling a starting pitcher is one of the most difficult decisions that a manager has to make during a baseball game,” said Kawczynski. “So, my goal was to use machine learning to create a model that can predict the future performance of pitchers, which could provide valuable information to managers.”
His work set out to improve prior work on analyzing pitchers by incorporating Statcast, a relatively new tool that provides a more detailed look at players and the game. Statcast is enabled by advanced tracking technologies developed by the MLB. These tools measure many more metrics than ever before, thus adding more data to a sport already acutely aware of the importance of statistics. After some experiments incorporating Statcast, Kawczynski developed a working model with XGBoost, a decision tree machine learning library, that improved on prior work on pitching prediction.
Brendan Wang ’23, a computer science major, chose to focus on the machine learning task of abstractive text summarization, where a large body of text is fed into a model, and the model produces a condensed version of the text. His project sought to improve upon this task because existing models (transformer models), repeat sentences or produce incorrect facts.
Wang put together a process based on an existing transformer model called T5-small and showed that it scored better in certain metrics compared to PEGASUS, a much larger state-of-the-art transformer model developed by Google Brain.
“Going forward, if we are able to generate accurate summaries in a resource-efficient manner, it will offer tremendous benefits to fields that routinely rely on information extraction, including law, medicine and education,” said Wang.
Serena Ren ’22, an ORFE major, decided to explore the murky world of art appraisals by asking the question: Can a machine learning model mimic or even surpass the judgment of seasoned art auctioneers?
“I wanted to see what drives the price of art and also figure out if there's any sort of statistical or machine learning model that can capture that process and predict what the price of art could be” she said. “Right now, prices seem very subjective and dictated by art appraisers.”
Ren cleaned data from major auction houses such as Sotheby’s and Christie’s and then deployed tools such as a neural network to model auction sale prices while taking into account the textual, numerical and visual features of artworks.
Her neural network model was found to perform well when it used numerical, textual and image data, but the human appraisers tended to be more accurate with their price estimates. But Ren was not deterred by the mixed results of her project. She saw that her model and dataset could serve as foundations for further study.
“We could add more features to the model and increase its predictive power,” she said. “The model could also be potentially used to analyze the prices of non-fungible token art. And some fellow students have already expressed interest in using the dataset I developed.”