SML 310 cracks open data science for all students, from sociology to computer science

Wednesday, Apr 17, 2019
by Sharon Adarlo

The final projects for the inaugural semester for SML 310: Research Projects in Data Science were eclectic: one student worked on creating a data set for fake news articles while another developed a face classification program to detect whether a person was a professional model or not.

In the course, held during the fall of 2018, Michael Guerzhoy, lecturer at the Center for Statistics and Machine Learning (CSML) at Princeton University, aimed to have the students, who either had some or little experience with the tools of data science and machine learning, develop technical skills while learning to apply data science tools in a rigorous way on real-world research problems.

The course can be taken to fulfill requirements for the Undergraduate Certificate Program in Statistics and Machine Learning, offered through CSML.

“I really enjoyed the class,” said Kitty Moraes, a sociology major who is graduating this year. “It was an interesting class because we all had different perspectives and different backgrounds in data science. Professor Guerzhoy worked really hard to make sure everybody in the class was able to complete the assignments and get a lot out of it.”

“And it’s definitely given me a wider set of skills in data science,” she added. "I hadn’t really coded in Python before that class, but the course got me started on that. I feel like the technical skills gained will be pretty useful, whether in a research or business environment.”

Moraes’ trajectory from a person who did not code to one who could code in Python was by design. Guerzhoy said entrance requirements were kept to a minimum so the course could serve students who didn’t have a hard science background. The course aimed to make data science tools accessible to students in disciplines beyond science and engineering.

“There is a lot more data now than there has ever been - partly because of the Internet and partly because people have started deliberately collecting and curating large data sets,” said Guerzhoy. “Students across campus see the increasing importance of AI, machine learning, and data science both to society and to research: from linguistics, to astrophysics, to law, to sociology and to the digital humanities.”

“But there are barriers for entry,” he said. “Introductory machine learning courses are currently geared towards engineers and computer scientists: they often focus on mathematically justifying machine learning algorithms and require substantial experience with calculus, linear algebra, probability theory and programming.”

The goal of the course was to teach the tools that can be used to perform rigorous research, whatever discipline that happens to be the students’ focus, said Guerzhoy.

In the beginning, students with little or no Python background could take training sessions with Melanie Bekx, a teaching assistant and Master of Finance student, so they could get up to speed. Guerzhoy structured the rest of the semester with mini-projects and a culminating final project.

For several of these projects, Guerzhoy is working with students to further develop their research so that they may be ready for future publication at a journal or conference.

“We want to engage students all over campus to participate in data science research,” he said.

The range of final projects reflected the varied background of the 12 students, who hailed from various departments across campus - sociology, politics, economics, computer science, math, neuroscience, the Woodrow Wilson School of Public and International Affairs and the Department of Operations Research and Financial Engineering.

Moraes tackled a class project that melded her interest in child development and date science. Specifically, she looked at American children with physical and intellectual disabilities and set up a hierarchical model that would try to track the correlation with a disability and certain mental and physical health conditions, such as depression and anemia, respectively.

Yun Teng, a senior computer science major, was interested in the question of what kind of face would be successful in the modeling industry.

“If you read a lot of interviews from movie or modeling scouts, a lot of those professionals say they immediately know if a person can be a model or not. I wanted to investigate if there was a more precise way to quantify human intuition, which is what drew me to this question,” he said.

Teng created a data set from a model directory and trained a neural network to look at these faces. His face classification program managed to determine whether a person is a professional model or not with 90 percent accuracy.

“It was surprising to me because if you gave me those images, I wouldn’t be able to tell,” he said.

When it comes to fake news, there aren’t many publicly available data sets to analyze rumors, conspiracy theories and disinformation on the internet. And with its rise and viral impact across social media platforms, many media experts say fake news has the potential to destabilize governments and target vulnerable communities.

Georgy (George) Noarov, a junior math major, tackled this problem for his final class project and is continuing this project under Guerzhoy.

The work involved thinking through how to define ‘fake news” and how to operationalize the concept so that news items can be classified as "fake" or "real" using crowdsourcing, said Guerzhoy.

“Professor Guerzhoy was very helpful and supportive at all stages,” Noarov said. “The class really prepares you to think in inquisitive and creative ways about what kind of machine learning research would be useful and how you can approach these topics that have not been clearly defined yet or don’t have much data. It’s very cool.”

For more on Guerzhoy, his academic webpage can be found at this link. Details on the course, which will be taught this upcoming fall, can be found here.

Course selection runs from April 22 to May 1.