CSML Internship Program Provides Students Valuable Research Opportunities

Written by
Sharon Adarlo
Oct. 13, 2021

Within a specially-made box in a lab on the campus of Princeton University, several fuzzy bumblebees (Bombus impatiens) flew through the air while a camera sat over the enclosure, watching and recording every moment. What made these bees distinct, besides their artificially constructed habitat, was that these insects were sporting identifying tags that enable reseachers to identify each individual bee.

Over the summer of 2021, Daniel Knapp, ‘23 physics major, used machine learning techniques to study the bee’s behavior. His summer project was made possible by a grant from the Center for Statistics and Machine Learning (CSML). The center also funded two other student projects.

“It was a really great foray into research programming,” said Knapp about his summer internship experience. “It was a fun way to learn how to write good code, how to document it well, and how to collaborate with other people. I also learned a lot about machine learning and that was very valuable.”

Every year, CSML welcomes funding proposals from students who want to perform research relevant to statistics and machine learning during the summer under a faculty mentor's oversight. This the second year that the center funded three students instead of two in previous years. The three undergraduates who took part in the annual CSML internship program did so remotely under COVID-19 social distancing requirements. In addition, CSML also gave Microsoft Azure credits to one student, Sungho Park ’22, who used it to clean and analyze financial data remotely on the Azure cloud.

“Despite COVID restrictions, we remain committed to fostering innovative research in our data science community,” said Peter J. Ramadge, CSML director. “And these students, working under constrained circumstances, were able to work on rigorous research pojects during this second summer of remote learning.”

 

Below we describe each student's summer research project:

 

Gene Chou ’22

“Optimized Synthetic Data Generation for Evaluating Fairness"

Chou, a computer science major, decided to focus on bias in artificial intelligence for his summer research project under Olga Russakovsky, an assistant professor of computer science, who is known for her work on fairness and diversity in the same field.

Bias is a hot topic in machine learning and data science, and that has pushed researchers to develop algorithmic interventions to try to mitigate bias. But Chou said there is a problem with current approaches because the datasets researchers use to evaluate or train these fairness algorithms are too small or the datasets being used are limited to certain domains, such as from finance or the criminal justice system.


Chou helped developed a model that would generate simulated datasets and tested various algorithms that are used to mitigate bias on these data sets. Russkovsky and Kaiyu Yang, computer science doctoral student, proposed the project, Chou said.

"The goal is to create a fairness benchmark that can help researchers develop algorithms knowing they would be useful on real-world datasets,” said Chou.

Chou evaluated the results on open-source repositories such as AIF360 and Algofairness.

“We generated data sets from probabilistic distributions such as Gaussian and Bernoulli distributions, rather than from specific domains,” he said. “In the future we could fine-tune these distributions to more closely align with certain domains as needed. The datasets contain synthetic features, labels, and protected attributes, simulating real-world datasets. The dimensions of these features can also be adjusted as needed.”

Chou said he plans to continue working on this project during his senior year.

“We have successfully created the pipeline for generating synthetic data and evaluating fairness, but we are still evaluating results, mainly using the two open- source repositories mentioned above,” he said.

 

Nobline Yoo ’23

“Evaluating and Expanding Vision and Language Systems”

For her project, Yoo, wanted to tackle computer vision with an interdisciplinary component - in this case, language. She focused on the topic of Visual Question Answering (VQA), in which a computer program is given an image to respond to and the program outputs textual information about the image.

“Let's say we have an image of kids in a playground, and we ask the computer how many kids are playing in the playground, we want the computer to be able to answer that question,” she explained, giving an example of VQA.

Yoo’s summer work was composed of two parts that incorporated VQA. First, she generated image datasets that incorporated a spatial point of reference. And in the second, she stress-tested VQA programs.

In order to explain the first project, Yoo said we need to look at human communication and how people often point at objects instead of verbalizing in clear terms where an object is located, such as saying, "Are there any cups to the left of the tray on top of the table?" She termed this action as verbal disambiguation. A person would point in the direction of the table instead.

Teaching software about pointing would help in the development of computer vision, Yoo said. For her project, Yoo combined spatial points and anticipated visual questions concerning images and made a dataset from these concepts. She also compared the accuracy of machine learning models on visual questions with point references versus verbal disambiguation. Working under Russakovsky as well, Yoo was continuing this research from her previous academic year.

In the second part of her project, Yoo wanted to make VQA systems more similar to humans by introducing uncertainty and stress-testing them with realistic questions to see where they fail and finding ways to improve them to catch failure cases.

“Uncertainty is important for generalizing from seen to unseen cases, so adding uncertainty to VQA systems can make state-of-the-art models more robust for use in real-life scenarios,” she said.

 

Daniel Y. Knapp ’23

“Computer Vision Tracking and Analysis of Bumblebee Behavior”

For Knapp’s project on bumblebee behavior, he decided to focus on gathering and analyzing data on aggression between the insects. This project was done under the aegis of Sarah Kocher, assistant professor of ecology and evolutionary biology and the Lewis-Sigler Institute for Integrative Genomics.

“The success of species with high level of social organization, such as bees, depends on intra-colony variation in behaviors like aggression,” said Knapp. “Due to their manageable size and variety of behaviors, bumblebee colonies have been a classic model for studying aggression and other behaviors.”

Studying complex social behaviors in animal groups, such as insects, has become easier in recent years due to advances in computation and the adoption of machine learning, said Knapp. This has enabled researchers to automate the tracking of behavior phenotypes and later connect that with certain genes.

To implement his project, Knapp used a machine learning software called SLEAP, which records and tracks bee body parts and lets researchers discriminate between individuals, and ArUco, which enables researchers to also distinguish between different bees via unique tags attached to each individual. He put these two methods together and optimized their integration in order to have a robust process to identify bees, which has historically been a complex problem to solve.

The second part of the project was to take this data, which was captured via video camera at high frame rates (up to 60 fps), and process them to map out social behavioral patterns or “network diagrams.” From there, Knapp developed measures of aggression among bee interactions based solely on how they are attracted and repulsed by each other when another bee approaches.

“It actually turns out bees have buddies and they hang out in groups,” said Knapp, who is still working on this project this semester. “We started using the data to generate the network diagrams of these bees and we have developed a process to find out which bees tend to hang out together a lot.”