Expanded summer internships yield insight and new skills at CSML

Thursday, Sep 17, 2020
by Sharon Adarlo

It was a fruitful summer of learning and growth for three Princeton undergraduate students. They took part in internships sponsored by the Center for Statistics and Machine Learning (CSML), where they optimized irrigation methods, built machine learning agents that process human commands, and explored the mechanism of how research papers get accepted by machine learning conferences.

Every year, CSML welcomes funding proposals from students who want to perform research relevant to statistics and machine learning during the summer under a faculty mentor's oversight. This year, the center expanded the program and funded three students instead of two. And the three undergraduates who took part in the annual CSML internship program did so remotely under COVID-19 social distancing requirements. 

"The center is committed to maintaining our vibrant data science community, albeit remotely, and fostering innovative research, whether done by students, postdocs, or faculty," said Peter J. Ramadge, CSML director. "We are pleased that these students were able to engage in summer research despite COVID-19 restrictions. It shows the resourcefulness of our students and our faculty."

Below we describe each student's summer research project:

Zhengyue Anna Dong '20. 

"To create a sustainability metric which can guide irrigation decisions."

For her summer internship, Dong continued her senior year CSML independent project, which concerns itself with irrigation modeling and evaluating irrigation strategies' impact.

In her original senior study, Dong extended an existing irrigation model and included a new weather forecast data component. In her summer project, she created a novel irrigation strategy that uses optimization and rainfall simulations to minimize water volume throughout the growing season. Dong also used a life-cycle assessment methodology to evaluate the environmental impact of different irrigation plans and crop yields for each scheme. From her results, Dong observed that there are trade-offs between profitability and environmental impact, and incorporating four days of weather forecast data may optimally balance both interests.

For her summer internship, she reframed her study as a sequential decision problem, said Dong, who received a bachelor's degree from the Department of Operations Research and Financial Engineering (ORFE). In machine learning, making sequential decisions is often modeled as an agent that observes a situation/environment, learns about the relevant variables, and builds on that accumulated information to make decisions. 

Dong's internship advisor was ORFE's Reggie Caudill, a visiting professor. She also consulted with Warren Powell, Professor of Operations Research and Financial Engineering, Emeritus, who gave her feedback on her CSML independent project.

"Seeing my project as a sequential decision problem allowed me to have a more comprehensive view to evaluate the overall sustainability of irrigation. Before that, I reviewed each portion somewhat independently," said Dong. "With this set up as a sequential decision-making problem, a business person can set up an irrigation framework that aligns with their financial and sustainability goals."

Michael Hu '21. 

"Constrained Policy Learning with Language."

For his project, Hu, a computer science major, delved into reinforcement learning - a machine learning subfield - and natural language processing under Karthik Narasimhan, assistant professor of computer science. Hu's research, an extension of his junior independent project last semester, specifically tackled how to program a reinforcement learning agent via voice command not to perform certain "unsafe" actions. Reinforcement learning is a type of machine learning that involves learning from repeated trials.

Hu said this process of telling an agent what is safe and not safe is currently hard to do by a typical person who would need to know how to code this agent. But what if programmers can develop a natural language processing component for these agents? Then people can issue verbal commands and make sure these agents don't do anything unsafe while these agents familiarize themselves with an environment, he said. An example of this in action, Hu said, could be deploying a cleaning robot in one's house. As this robot learns the environment, the owner can issue verbal commands for the robot not to go down a set of stairs and accidentally break itself or touch certain things in a room that can break easily.

Hu developed an agent, tested it out in a 2D environment, and trained it on language. He also issued commands that forbade this agent from doing specific actions.

"This project clarified for me that the way we interact with agents in the world all should have a language component," he said. 

Ryan Lee '21. 

"What Reviewers See: A Visual Analysis of Conference Papers." 

Lee, a math major, started his project after learning that machine learning scholars criticize the review process for conference papers. People are concerned that articles are getting accepted by the review process by pleasing the reviewers, but these papers may not have depth.

Examples of this would include a submitted paper having irrelevant equations that add superficial theoretical depth or the citation of many articles with a reviewer's research included, leading a reviewer to lean more to accepting the paper for a conference.

For his project – which was overseen by CSML lecturer Daisy Yan Huang – Lee analyzed the relationship between the number of equations or other visual hints in papers and the outcome for all articles submitted to the 2019 International Conference on Learning Representations. He automated a large part of the process of extracting information using various Python libraries.

In his project, Lee showed a moderately strong correlation between the acceptance rate of a paper and the number of figures and appendix pages it has. The conference he looked at had a hard limit of 11 pages, but some articles had appendixes twice as long.

After the summer, Lee wants to continue this study and explore this pattern beyond a single year to see whether this is an ongoing trend. He will be using Mechanical Turk to extract more interesting visual information from papers such as the number of equations and the colors in the figures.

"This was a great study to work on," said Lee. "I had to do everything from working with messy data, extracting it, cleaning it up, and then analyzing the data."