There is an art to taking data and distilling it into a graphic that is clear, understandable and compelling. If done properly, mass amounts of information can be wrangled to tell a riveting story or reveal insights that were once hidden.
That’s the task that undergraduate students in this past semester’s SML 201: Introduction to Data Science undertook. They were given a variety of data sets to choose from, from COVID-19 cases to dog names in New York City, and asked to create a graphic illustrating an interesting observation or trend they uncovered from the information they had in front of them.
One student made a graphic (complete with cat and dog silhouettes) that showed that slightly more men loved cats compared to women while women liked dogs more, dispelling the stereotypical cat lady image. In another arresting graphic, two students created a revolving 3D globe with cases of COVID-19 represented as spikes.
“We got a lot out of this assignment. We learned how to effectively manipulate and visualize data,” said Reilly Bova ’20, who along with Joanna Kuo ’22, worked on the interactive pandemic globe. (An interactive version of their project can be found here.)
The students were taking part in a data visualization contest that Michael Guerzhoy, a lecturer at Princeton’s Center for Statistics and Machine Learning (CSML), held for his course. This is the second year Guerzhoy held the competition. In addition to pandemic statistics and dog information from New York City records, students had the opportunity to work on a dataset of user profiles on OKCupid, a dating website. Winning submissions were showcased virtually to the class with projects receiving silver and gold medals for work that was interesting, informative and visually appealing.
“It’s the most relevant thing right now, and COVID-19 played a big role in my senior year,” said Bova on why he wanted to look at the COVID-19 data set.
Why the use of real data sets?
“Part of any learning process is motivation,” said Guerzhoy. “I think students become more motivated to learn if they see that the kind of stuff that is being taught in class can have a direct impact on them finding something out about the world they did not know about or communicating information so that people and policy makers can make informed decisions.”
For their gold winning entry, Kaelix Johnson ‘22 and Hien Pham ‘23 collaborated on two colorful graphics using circle charts that displayed drug use by profession, drawing on self-reported data from OKCupid. One graphic displayed the self-reported frequency of drug use by profession, showing that most people said they never use drugs. But when the two took out data for people who have never used drugs, they found military professions were more often into drugs. Johnson cautioned that the data may not be accurate because the information was volunteered.
“I personally had a lot of fun meddling around with the data and seeing what we can find that would be interesting and workable,” said Pham, who wants a career as a data scientist.
Edoardo Celani, a junior and an exchange student from Bocconi University in Italy, uncovered that slightly more men preferred cats versus women, while women liked dogs more.
“I had a glimpse of what working in data science looks like,” said Celani, who enjoyed looking at OKCupid data and finding something new. He won a silver for his efforts.
Many other students submitted entries.
Emily Philippides ’22 worked on the same OKCupid data and found that the taller a man is, the higher income they make. For women, there didn’t seem to be a correlation between height and earnings.
Anthony Hein ‘22 delved into New York City records on dogs and discovered that there was a correlation between human names and a dog’s gender: More female dogs were given human names than male dogs.
Matthew Trotter ‘22 created a graphic mapping the timeline of COVID-19 cases in the United States. He compared Los Angeles, New York and Philadelphia, and saw that there was a decline in cases as local governments instituted stay-at-home orders, state of emergency declarations, and when the CDC publicized the concept of social distancing.
Khatna Bold ’21 also looked at virus cases and generated a map that showed the spread of the pandemic, overlaying bubbles that showed counties with the most coronavirus cases along with different shades of blue for each state in order to display cases per 100,000 people.
“Some map data visualizations can be misleading because they highlight population density more than anything else,” he said. “When you actually plot how many cases there are per 100,000 people, you see that rural areas are in just as much trouble as the rest of us.”
Mia Rosini ’21 explored the correlation between dog size and the length of their names and found there was no significant correlation.
“It was still an interesting study to run even though the findings were not significant to the world. It still contributed to the study of human decisions,” she said.
“I learned that doing research projects on your own can be challenging,” she continued. “But I learned skills that could be applicable to other areas of study.”
An article on last year’s data visualization contest can be read here.