Arthur Spirling: advocating for replicability in research using language models

May 2, 2025

In the last couple of decades, there’s been a push in the sciences to ensure research results published in top journals are replicable. Many journals require not only the data used in research being submitted for publication, but also independent replication where another party runs the data and code and is able to acquire the exact same results. 

However, in the midst of the boom in using artificial intelligence in the form of language models, Class of 1987 Professor of Politics at the Department of Politics at Princeton University Arthur Spirling has noticed that replicability has fallen to the wayside.

The main issue, Spirling said, is the use of models that are proprietary – which is to say, models for which the code and training data is not publicly available. “You can't look in the guts of ChatGPT, understand how it works, and it's changing all the time,” said Spirling. If researchers publish results which were spurred by the use of a proprietary model, it prevents others from looking inside the model to see just exactly how the results were reached. “This seems a real threat to scientific transparency and integrity.”

Resolving an urgent problem

On April 8, Spirling gave a seminar at the Center for Statistics and Machine Learning, during which he discussed the problem of replication with LLMs and some recommendations for best practices for political scientists who use the models for their research. The talk was a part of the center’s spring lunchtime faculty seminar series. 

The worst case scenario, Spirling said, would be the field of political science building its knowledge base around published work that no one can replicate, therefore uncertain whether the results are actually right or wrong. “The idea of building a whole sort of tissue of results around basically just some random answers that you got out of a large language model, which may not even be accessible anymore, strikes me as a very bad outcome,” said Spirling. 

In the political sciences, human experts have historically been the ones evaluating documents for information. For example, human researchers would look through party manifestos or speeches and designate their alignment on the political spectrum for how conservative or liberal they are. Now, Spirling said, tasks like these are oftentimes handed off to large language models, which tend to show variance in results rather than one exactly replicated answer. One might counterpoint that human experts show variance as well, but Spirling said human variance isn’t as much as one might expect – and that unlike models, humans are very predictable. 

“These language models can fail in very weird, unpredictable ways,” said Spirling. “They suddenly stop giving us answers, they may be able to do something one month and can't do it another month with a new version.”

Spirling said overall, he sees language models as a useful tool for political scientists. However, he’d like to set a standard that researchers be clear about exactly how the model was used in their work while also going back and checking if the results obtained are replicable. At the high level, his goal is to get researchers to use open-source software for the greatest transparency. “I would like to get non-replicable, proprietary models out of the general workflow for science and social science,” said Spirling. 

“Thinking about scientific transparency in the age of AI is just absolutely crucial,” said Spirling. “We as a field haven’t been clear about what the standards are for this new technology and I think this an urgent problem, and we have to resolve it.”