OpenAI’s powerful new language model, GPT-4, was barely out of the gates when a student uncovered vulnerabilities that could be exploited for malicious ends. The discovery is a stark reminder of the security risks that accompany increasingly capable AI systems.
Last week, OpenAI released GPT-4, a “multimodal” system that reaches human-level performance on language tasks. But within days, Alex Albert, a University of Washington computer science student, found a way to override its safety mechanisms. In a demonstration posted to Twitter, Albert showed how a user could prompt GPT-4 to generate instructions for hacking a computer, by exploiting vulnerabilities in the way it interprets and responds to text.
While Albert says he won’t promote using GPT-4 for harmful purposes, his work highlights the threat of advanced AI models in the wrong hands. As companies rapidly release ever more capable systems, can we ensure they are rigorously secured? What are the implications of AI models that can generate human-sounding text on demand?
VentureBeat spoke with Albert through Twitter direct messages to understand his motivations, assess the risks of large language models, and explore how to foster a broad discussion about the promise and perils of advanced AI. (Editor’s note: This interview has been edited for length and clarity.)
VentureBeat: What got you into jailbreaking and why are you actively breaking ChatGPT?
Alex Albert: I got into jailbreaking because it’s a fun thing to do and it’s interesting to test these models in unique and novel ways. I am actively jailbreaking for three main reasons which I outlined in the first section of my newsletter. In summary:
- I create jailbreaks to encourage others to make jailbreaks
- I am trying to exposed the biases of the fine-tuned model by the powerful base model
- I am trying to open up the AI conversation to perspectives outside the bubble — jailbreaks are simply a means to an end in this case
VB: Do you have a framework for getting round the guidelines programmed into GPT-4?
Albert: [I] don’t have a framework per se, but it does take more thought and effort to get around the filters. Certain techniques have proved effective, like prompt injection by splitting adversarial prompts into pieces, and complex simulations that go multiple levels deep.
VB: How quickly are the jailbreaks patched?
Albert: The jailbreaks are not patched that quickly, usually. I don’t want to speculate on what happens behind the scenes with ChatGPT because I don’t know, but the thing that eliminates most jailbreaks is additional fine-tuning or an updated model.
VB: Why do you continue to create jailbreaks if OpenAI continues to “fix” the exploits?
Albert: Because there are more that exist out there waiting to be discovered.
VB: Could you tell me a little about your background? How did you get started in prompt engineering?
Albert: I’m just finishing up my quarter at the University of Washington in Seattle, graduating with a Computer Science degree. I became acquainted with prompt engineering last summer after messing around with GPT-3. Since then, I’ve really embraced the AI wave and have tried to take in as much info about it as I can.
VB: How many people subscribe to your newsletter?
Albert: Currently, I have just over 2.5k subscribers in a little under a month.
VB: How did the idea for the newsletter start?
Albert: The idea for the newsletter started after creating my website jailbreakchat.com. I wanted a place to write about my jailbreaking work and share my analysis of current events and trends in the AI world.
VB: What were some of the biggest challenges you faced in creating the jailbreak?
Albert: I was inspired to create the first jailbreak for GPT-4 after realizing that only about <10% of the previous jailbreaks I cataloged for GPT-3 and GPT-3.5 worked for GPT-4. It took about a day to think about the idea and implement it in a generalized form. I do want to add this jailbreak wouldn’t have been possible without [Vaibhav Kumar’s] inspiration too.
VB: What were some of the biggest challenges to creating a jailbreak?
Albert: The biggest challenge after creating the initial concept was thinking about how to generalize the jailbreak so that it could be used for all types of prompts and questions.
VB: What do you think are the implications of this jailbreak for the future of AI and security?
Albert: I hope that this jailbreak inspires others to think creatively about jailbreaks. The simple jailbreaks that worked on GPT-3 no longer work, so more intuition is required to get around GPT-4’s filters. This jailbreak just goes to show that LLM security will always be a cat-and-mouse game.
VB: What do you think are the ethical implications of creating a jailbreak for GPT-4?
Albert: To be honest, the safety and risk concerns are overplayed at the moment with the current GPT-4 models. However, alignment is something society should still think about and I wanted to bring the discussion into the mainstream.
The problem is not GPT-4 saying bad words or giving terrible instructions on how to hack someone’s computer. No, instead the problem is when GPT-4 is released and we are unable to discern its values since they are being deduced behind the closed doors of AI companies.
We need to start a mainstream discourse about these models and what our society will look like in five years as they continue to evolve. Many of the problems that will arise are things we can extrapolate from today so we should start talking about them in public.
VB: How do you think the AI community will respond to the jailbreak?
Albert: Similar to something like Roger Bannister’s four-minute mile, I hope this proves that jailbreaks are still possible and inspire others to think more creatively when devising their own exploits.
AI is not something we can stop, nor should we, so it’s best to start a worldwide discourse around the capabilities and limitations of the models. This should not just be discussed in the “AI community.” The AI community should encapsulate the public at large.
VB: Why is it important that people are jailbreaking ChatGPT?
Albert: Also from my newsletter: “1,000 people writing jailbreaks will discover many more novel methods of attack than 10 AI researchers stuck in a lab. It’s valuable to discover all of these vulnerabilities in models now rather than five years from now when GPT-X is public.” And we need more people engaged in all parts of the AI conversation in general, beyond just the Twitter Bubble.