Special Python workshop teaches scientists to make software for wider research community

Wednesday, Nov 10, 2021
by Sharon Adarlo

Many researchers at Princeton University and elsewhere develop their own software programs to help them elucidate complex processes and solve interesting problems, from biomedicine to water management. But when it comes to making the code available to the wider research community, these prototype programs need to make a technological leap to become more robust and user friendly.

Last month, two experienced instructors, who have had recent success in streamlining and deploying open-source code, led a DataX workshop addressing this need for user ready, research software programs.

Vineet Bansal, senior research software engineer at Princeton, and Brian Arnold, DataX data scientist, co-taught the October 1st workshop, “Best Practices in Python Packaging,” in order to give researchers the tools to streamline and package their code for other scientists. The workshop was geared towards researchers who already use Python, one of the most popular coding languages in the world that has made inroads into research computing due its versatility and extensive toolset.

The workshop was sponsored by Schmidt DataX Fund, a portion of which is overseen by the Center for Statistics and Machine Learning (CSML). Bansal is jointly appointed to CSML and the Princeton Institute for Computational Science and Engineering (PICSciE). Besides DataX and CSML, Arnold is affiliated with the computer science department.

“This workshop is very useful because it equips researchers with vital tools to disseminate their work and get recognized for it,” said Peter Ramadge, CSML director. “And in order to generate new insights and innovation in research and technologies, researchers should share their work and collaborate with their peers. The goals of the workshop facilitate that process.”

Additionally, Arnold said the workshop is important because “many of these details may not be taught in classrooms and involve topics that students may learn along the way as they do research.”

The instructors divided the one-day workshop into six modules: an introduction, testing code, Python package structure, packaging data, continuous integration, and lastly, versioning and continuous development. Videos for the entire workshop are now available at this link.

Students learned “about structuring code as modules and packages, and publishing packages on PyPI and conda, both of which are commonly used in the life sciences to install open-source software,” said Arnold. The workshop gives students “insights into ensuring reproducibility in research through the process of versioning, continuous integration and testing.”

The final part of the workshop involves looking at a sample Python project that puts all these best practices together, said Arnold.

“We feel that while Python programming is now popular (and indeed required) among students of all disciplines, what is perhaps less prevalent is the knowledge of modularizing, testing, and packaging up one's Python code to make it easily accessible for everyone,” said Bansal. “This workshop was meant to address some of those gaps, while giving us the opportunity to impart our opinions on the do's and don'ts while undertaking this journey.”

The idea for the workshop came about from Bansal’s and Arnold’s experience in the streamlining and deployment of the HATCHet code, an open-source cancer research software. HATCHet stands for Holistic Allele-specific Tumor Copy-number Heterogeneity, an algorithm that is capable of finding and analyzing genes that have been duplicated or deleted in multiple tumor samples from a single cancer patient. A previous article on HATCHet can be read here.

Ben Raphael, professor of computer science, and Simone Zaccaria, a former postdoctoral research associate at Princeton, first released HATCHet to the public in 2018, but it was not quite ready for wider public use due to quirks in the program.

Bansal, whose main job is to help researchers on software projects, developed a cloud mode for HATCHet in addition to making certain structural changes so that the software was easier to use. Arnold joined the software project to add his expertise on biology, genomic sequencing, and bioinformatics, which is concerned with using computational techniques to analyze and process biological datasets.

“After our collaboration in polishing up HATCHet code, we realized that we had gathered some technical insights on how to effectively transform research code into something that could be widely used by other practitioners in the field. We thought it would be a good idea to pass on some of these ideas and best practices to students who are looking to disseminate their Python code to the wider research community,” said Bansal.

Arnold and Bansal hoped that attendees came away with valuable skills for their next projects.

“I hope the attendees got lots of useful information all in one place, and will have the templates and knowledge to write their Python software in a way that is more robust and more easily distributed,” said Arnold.

“It takes some work to get from ‘code’ to ‘software,’ but we feel that this translation is well worth the effort,” said Bansal.