Since the rapid proliferation of generative AI in the 2020s, policymakers have grappled with the question of how to regulate models to ensure safety. In California, the recently vetoed bill SB 1047 proposed to set into law a series of evaluations models must pass in order to be deemed safe. But the bill also accounted for the fact that users can go into models and modify them in a variety of ways. So, how can developers take reasonable safety precautions even when others can modify models to remove guardrails?
“As of right now, it is unclear how to provide assurances that your model can’t be used for some harmful purpose, even if people can remove all your safeguards, and whether that’s even a tractable problem,” said Peter Henderson, an assistant professor at Princeton University with appointments in the Department of Computer Science, the School of Public and International Affairs and the Center for Information Technology Policy. “We get into this gap between what’s currently technically possible and what policymakers might want.”
On Sept. 24, Henderson presented a seminar at the Center for Statistics and Machine Learning (CSML) exploring how the government might evaluate AI models for safety, even when they can be customized. Henderson’s is the first in the 2024 fall series of Lunchtime Faculty Seminars featuring CSML participating faculty.
Structural solutions
Ask ChatGPT how to construct a bioweapon and the large language model will deny your request with an answer along the lines of “Sorry, I can’t do that.” That’s because artificial intelligence model developers bake safety guardrails into their code to minimize the risks posed by giving the public unrestricted access to AI tools, which are trained on data scraped from all over the internet, including harmful websites. “The model provider says, ‘My model is safe, it refuses all these requests,’” said Henderson.
As it turns out, though, these safety guardrails can be easily removed via fine-tuning, or customization, by outside users. Users can customize models by inputting data to get the model to perform a more specific task. Fine-tuning with explicitly harmful data, for example, can turn a friendly model into something malicious. But, Henderson said, he and his colleagues found that users don’t even have to input harmful data to compromise model safety. Even fine-tuning with benign data can effectively break guardrails. “This is a really difficult problem,” said Henderson. “There are not that many robust solutions on the technical side, so we have to think about more structural solutions.”
Part of the solution, Henderson said, is building resilient systems in society. “If we’re concerned about a failure mode and we can’t guarantee that the failure mode won’t happen in a given deployment setting, we should either make sure the system isn’t deployed in that setting or that the system itself can absorb it,” he said.
However, Henderson says he believes there are many cases where using AI in governance is a net positive. Currently, he and his colleagues at Stanford University are working with Santa Clara County in California to use AI tools to identify racially restrictive covenants in land deeds – clauses that prohibit the sale of land to people of certain races. These covenants have been illegal for decades, but still remain in millions of documents. Going through every single document manually would be a lengthy process. However, with the help of AI, the covenants can then be quickly identified and removed.
The land deed project is just one exciting example of the good AI technology can do. “Not every use of AI in the public sector is going to be harmful,” said Henderson. “There are actually positive use cases that would make the government better and more efficient and equitable.”