CSML students use data science to tackle big questions from real estate to modern slavery

Thursday, Jun 3, 2021
by Sharon Adarlo

As COVID-19 stalked the streets of New York City last year and racked up increasing mortality rates, there were many anecdotal stories in the news media of people abandoning the city. But was it true throughout the Big Apple and what did real estate sales data actually show?

Hunter Sporn ’21, an economics major at Princeton University, decided to use machine learning tools to look at patterns in housing market data from public and private sources in order to answer these questions.

“COVID transformed the world in so many different ways. I wanted to explore how it impacted an asset class, real estate, which really touches everybody,” said Sporn.

Sporn looked at housing data from the city and private sources such as Zillow, the real estate website. Before coming to his results, Sporn said he had to do some data cleaning since the numbers he collected were raw and disorganized. After performing data analysis, he showed that “the economic burden of rising (COVID) case counts, with respect to housing, is born unevenly across areas of the city.”

He saw that higher COVID case counts do indeed lead to lower housing prices but that population density had little impact on home values – the opposite of what he expected.

Sporn presented his project at the annual undergraduate poster session held by the Center for Statistics and Machine Learning (CSML) in May. One hundred students participated in the event, representing a wide range of departments and programs, from operations research and financial engineering to sociology. This is an increase from last year’s total of 72, also held virtually due to the pandemic, reflecting the increasing interest in data science as a discipline on campus.

The students' projects are an important component of CSML's Undergraduate Certificate Program in Statistics and Machine Learning. A final requirement of the program is an independent project that incorporates data science in a significant way and participation at the poster session. 

“We’ve had a year of remote learning and virtual events. Despite disrupted interactions and the easy exchange of ideas, our students have risen to the challenge and presented a variety of projects that tackled big questions and used data science in interesting and innovative ways,” said Peter J. Ramadge, the CSML director. “We are proud of them.”

Three students received special recognition this year for their research projects: 

Kavya Chaturvedi ’21, Princeton School of Public and International Affairs

“Can Words Speak Louder Than Actions? A Text Analysis Based Evaluation of the UK Modern Slavery Act”

Byron Chin ’21, Department of Mathematics

“Optimal Reconstruction of Block Models” 

Alexandria Skarzynski ’21, Department of Sociology

Investigating Matching Patterns and Status Exchange in U.S. Citizen and Non-U.S. Citizen Intermarriages Using the Exchange Index”

An article profiling all three winners and their projects is forthcoming.

The wide diversity of student projects is a reflection of the increasingly broader interest in both theoretical developments in modern statistics and machine learning, such as the project above by Chin, and the application of these tools in a range of applications, such as the projects by Chaturvedi and Skarzynski.

For her project, Margaret Baughman ’21, School of Public and International Affairs (SPIA), looked closely at the Chinese government’s influence operations on Western social media. She hit upon a unique way to analyze what is generally an opaque process: collecting Chinese government public procurement documents with outside firms that are contractually obligated to provide services.

Baughman used a variety of methods in her project including linear regressions and textual analysis, but she said her biggest contribution in this study was coming up with a novel data set based on the public procurement documents. Her project has subsequently garnered interest from outside campus such as federal institutions and other researchers.

David Lipman ’22, computer science, focused on computationally detecting melanoma, which is the deadliest form of skin cancer, but can be treated and cured if found early.

Deep learning techniques have been used as a detection method for this type of skin cancer, but Lipman decided to take a computer vision approach that analyzes different features of melanoma. Focusing on features allows for further insight into this type of cancer.

For his project, Lipman incorporated the ABCD rule of skin cancer, which states that skin lesions that are A: asymmetric in terms of shape or texture, B: have irregular or poorly defined borders, C: many colors or color variations from one area to another, and D: diameters of 6mm or larger are more likely to be melanoma.

Lipman developed a generalizable model to detect melanoma based on these ABCD features. The model resulted in an average validation accuracy of 81.9% on a dataset of 10,180 dermoscopic images. Also, his analysis unearthed a few important takeaways on melanoma features.

“The most influential features in classifying a lesion's diagnosis were determined to be the number of unique colors appearing within a lesion, the intersection between 3D histograms of colors within and outside of each lesion, and the irregularity of a lesion’s border. In addition, the ‘C’ features overall have the most predictive power in classifying a lesion,” he said.

Tyler Skow ’21, computer science, used data science to analyze QAnon, considered to be the most influential conspiracy theory in the modern era.

“I felt QAnon is worth researching because of its addictive nature and its ability to drive thousands of people into violence,” said Skow. “By studying the nature of these conspiracy theories, we stand a much better chance of stopping its spread.”

Twitter has been a conduit for QAnon disinformation and provides a rich trove of data on this problem. Skow took Twitter data and applied network analysis, topic modeling and classification techniques to get a clearer picture of QAnon adherents and their behavior.

He first put together a data set of 1 million QAnon tweets and made it public for future research. After applying machine learning and other data science techniques on this data set, he came to a few interesting conclusions.

“We find that the vast majority of users engrossed by QAnon are highly polarized and isolated from other users,” he said.

He also found that a core group of QAnon accounts were responsible for the bulk of information being disseminated in the community. Skow also trained logistic regression, naïve bayes and support vector machine classifiers on a QAnon user’s historical tweets. From this process, he was able to accurately classify Twitter accounts most at risk of fixation with the conspiracy.

Diana Dayoub ‘21, SPIA, decided to look into government corruption in India, specifically the issue of electing criminally accused members to the legislature and their impact on economic outcomes.

For a couple of decades now, criminals have been winning elections in increasing numbers in India,” said Dayoub, explaining her interest. “It just shocked me that these people were going from jail to parliament.”

Dayoub took census data on population, economics, socio-economic caste, in addition to data on politician affidavits and even night light captured by satellites and which is now used by researchers as a gauge for economic activity. After performing data analysis such as logit fixed effects model and regression discontinuity design, Dayoub found that electing a criminally accused politician led to decreases in employment and in school counts.

Alyssa Humeston ’22, sociology, did her research on the criminal immigrant stereotype, specifically on the Latino population in America.

“I was looking at how the presence of Latinos in a state could affect stereotypes against them,” said Humeston, who found that as Latino population increased, white respondents were more likely to believe that immigrants increased crime compared to a decrease in that perception among Black respondents.

Humeston said she enjoyed doing the project because it combined quantitative and qualitative analysis and that it gave her a deeper understanding of the issue because these modes of analyses complemented each other.

Nabhonil Kar ’21, operations research and financial engineering major, focused his research on looking at financial data, specifically stock data. Estimating stock prices and volatility are essential tasks for many financial professionals, but noise in data and faulty assumptions can lead to inaccurate results.

Machine learning techniques, on the other hand, have found great success in finding structure in noisy and data-intensive environments with relatively few model assumptions,” said Kar, who used Gaussian process regression in his project to study how noise in data would impact the price parameters of a stock.

After completing his project and the CSML certificate program, Kar said he’s coming away with valuable lessons from the experience that will help him in his next step after graduation: a data scientist position at a trading firm in Chicago.

“I learned a lot completing the certificate,” said Kar. “The classes and projects I undertook will no doubt come in handy in an applied setting. The CSML certificate gives you impactful tools to see the world in a different way and answer some interesting questions. Taking the CSML certificate was a no brainer.”