CSML students advancing data science education on campus and beyond

Wednesday, Jul 15, 2020
by Sharon Adarlo

Student course assignments in Princeton data science classes are helping improve data science education worldwide. These assignments, developed by students and staff in the Center for Statistics and Machine Learning (CSML), and involving real-world data sets, have been recently published in scientific journals to serve as examples of teaching data science. 

"It's fantastic that our students are helping to shape future data science education. The projects serve to deepen their knowledge and help students who will work on their results as assignments in data science courses. It is increasingly important to have data science as part of a well-rounded general education," said Peter J. Ramadge, CSML director. 

Students Claire S. Lee '20 and Jeremy Du '20 worked with Michael Guerzhoy, CSML lecturer, to extend and improve an assignment originally designed for SML 201 - Introduction to Data Science. 

In this project, students audit COMPAS, a software program created by the company Northpointe, Inc. (now Equivant). The software assesses a criminal defendant's recidivism risk. COMPAS has been controversial: a 2016 ProPublica investigation showed the program is biased against African Americans. In a 2018 article in Science Advances, researchers Julia Dressel and Hany Farid show how to obtain a score almost equivalent to the COMPAS score by using only the defendant's sex, age, and the number of priors.

Under Guerzhoy's tutelage, Lee and Du developed an assignment that utilizes the dataset ProPublica used for its investigation and created versions of the assignment useful for a wide range of classes. They also developed a short tutorial on predictive modeling and a software framework in Java and Python to accompany that tutorial and assignment.

With this assignment, an instructor can teach predictive modeling, and then have students apply what they learned by reproducing results published in Science Advances by Farid and Dressel in 2018.

"The goal is to make predictive modeling more accessible to beginners and across data science and introductory computer science courses," said Lee. "Training in predictive modeling is essential early on because artificial intelligence is now such a big part of our society."

The team presented their work and the assignment at the Conference on Innovation and Technology in Computer Science Education (ITiCSE) this June. The companion website for the project is available here: https://predictivemodellingearly.github.io/.

In the fall of 2018, Georgy Noarov ’20, a student in SML310 – Research Projects in Data Science, worked with Guerzhoy to collect a novel and large-scale dataset of fake news items. This project was conceived as a continuation of an assignment Guerzhoy designed with Lisa Zhang, a professor at the University of Toronto, in 2018, and which theypublished at the Symposium for Educational Advances in Artificial Intelligence and archived in the Model AI Assignments repository.

Noarov and Guerzhoy, together with Lisa Zhang published an additional article about the project in AI Matters in September 2019. The article is available at https://sigai.acm.org/static/aimatters/5-3/AIMatters-5-3-05-Guerzhoy.pdf

Guerzhoy said the fake news assignment had been used at several universities across the world, including in several courses at the University of Toronto and the Wentworth University of Technology in Boston.

When teaching SML 201, Guerzhoy and Stephen Keeley, Princeton Neuroscience Institute postdoctoral fellow and a preceptor in SML 201, put together a project assignment centered around a data set collected in an Intensive Care Unit (ICU). This dataset, called MIMIC-II, contains physiological and other patient information on about 60,000 ICU admissions. The two thought the project would engage the attention of the many pre-med students enrolled in SML 201, while at the same time teaching fundamental data science skills.

"Because of the importance of the data set and the gravity of the assignment, the students felt they were doing something important and worthwhile," said Keeley. "We wanted them to understand that these models have powerful implications."

Guerzhoy and Keeley presented the assignment at the Symposium for Educational Advances in Artificial Intelligence in New York in February this year. It is now publicly available in the Model AI Assignments archive, found here: http://modelai.gettysburg.edu/2020/icu/

As a discipline, data science is a relatively new addition to Princeton, so is the teaching of data science, explained Guerzhoy. Demand for data science courses is increasing, and questions abound on how to teach the discipline effectively and engage students with the kind of problems and issues they will encounter as researchers, practitioners, and, most importantly, well-informed citizens.

"It's an exciting time to teach data science. Data science classes have been appearing all over the world in the last five years," said Guerzhoy. "We are still all collectively figuring out the best way to teach it."