Tech Insight : What Is ‘Jailbreaking’ ChatGPT?

In this insight, we look at the ‘Jailbreaking’ concept in ChatGPT and other LLMs, and at what steps can be taken to mitigate the risks to users. 


Jailbreaking, in general, refers to the process of removing restrictions or limitations imposed by a device or software, often to gain access to features or functionality that were previously unavailable. One popular example is removing the artificial limitations in iPhones to enable users to install apps not approved by Apple. 

Jailbreaking ChatGPT 

In the context of chatbots, and specifically ChatGPT, jailbreaking refers to designing prompts to make ChatGPT bypass its rules or restrictions that are in place to prevent the chatbot from producing hateful or illegal content. 

Some security researchers, technologists, and computer scientists have recently reported developing successful jailbreaks and prompt injection attacks against ChatGPT and other generative AI systems to show how vulnerable these systems are. For example, Adversa AI’s CEO Alex Polyakov has reported taking just a couple of hours to break GPT-4 (the latest version of the GPT – Generative Pre-trained Transformer – series of language models from OpenAI). Polyakov has also reported making a “Universal LLM Jailbreak,” which works on many different large language models (LLMs) including OpenAI’s GPT-4, Microsoft’s Bing chat system, Google’s Bard, and Anthropic’s Claude. The jailbreak can fool systems into generating detailed instructions on creating meth and how to hotwire a car! 

Prompt Injection Attacks 

Prompt injection attacks, which are related to jailbreaking, can be used to insert malicious data or instructions into AI models in various ways, depending on the specific type of AI model and the nature of the attack. Examples include: 

– Language Models. As in the case of ChatGPT, language models are vulnerable to prompt injection attacks where the attacker inputs carefully crafted prompts to generate outputs that may contain hate speech or biased language. In this case, the attacker can inject prompts that contain malicious content such as hate speech or illegal instructions. 

– Image Recognition Models. In image recognition models, the attacker can inject malicious data by modifying or manipulating the images used for training the model. For example, an attacker could add hidden messages or images to the training data to alter the model’s behaviour. 

– Recommender Systems. In recommender systems, an attacker could inject malicious data by creating fake user accounts and manipulating the recommendations generated by the system. For example, an attacker could create fake accounts and provide biased ratings and feedback to manipulate the recommendations generated by the system. 

– Autonomous Systems. In autonomous systems such as self-driving cars, an attacker could inject malicious instructions or data by exploiting vulnerabilities in the software or hardware used by the system. For example, an attacker could hack into the car’s computer system and inject malicious code to alter its behaviour. 

Ethical and Security Issues Highlighted 

In a blog post on the Adversa AI’s website about his research, Polyakov highlighted the fact that being able to deploy a Universal LLM Jailbreak raises the issues of: 

– Ethics and AI Safety. Ensuring responsible usage is crucial to preventing malicious applications and safeguard user privacy. Also, enterprises implementing LLM’s who fully trust vendors may not be aware of potential issues. 

– Demonstrating such jailbreaks shows a fundamental security vulnerability of LLM’s to logic manipulation, whether it’s jailbreaks, Prompt injection attacks, adversarial examples, or any other existing and new ways to hack AI. 

How To Prevent Jailbreaks and Prompt Injection Attacks In ChatGPT 

As an AI language model, ChatGPT is designed to be secure and robust, but there is always a possibility that malicious actors could attempt to jailbreak it. Here are a few strategies that can help prevent and mitigate the effects of jailbreaks in ChatGPT: 

– Increase awareness of the threat. Increasing awareness and assessing AI-related threats are important first steps in protecting businesses. 

– Implement Robust Security Measures during the development. Developers and users of LLMs must prioritise security to protect against potential threats.  

– Secure Development. One of the most effective ways to prevent jailbreaks is to ensure that ChatGPT is developed using secure coding practices. This includes minimising the use of vulnerable libraries and APIs, implementing secure authentication and access controls, and regularly testing the system for security vulnerabilities. Assessment and ‘AI Red Teaming’ of models and applications before their release is an important step in mitigating potential issues around jailbreaking. 

– Monitoring and Detection. Regular monitoring and detection of suspicious behaviour can help prevent jailbreaks by enabling the identification of unusual activity and the prompt response to security incidents. This includes implementing security monitoring tools and conducting regular security audits to identify potential vulnerabilities. 

– Restricted Access. Restricting access to ChatGPT can help prevent jailbreaks by limiting the number of people who have access to the system and minimising the risk of insider threats. This includes implementing strong access controls, using multi-factor authentication, and restricting access to sensitive data and features. 

– Regular Updates. Regular updates and patches can help prevent jailbreaks by addressing known vulnerabilities and updating security measures to address new threats. This includes keeping the system up to date with the latest security patches and regularly reviewing and updating security protocols and procedures. 

– It’s also important to ensure that ChatGPT, going forward, is developed with ethical considerations in mind and that it adheres to responsible AI principles. This includes considering the potential impacts of the technology on society and taking steps to promote fairness, accountability, and transparency in its usage. 

– AI Hardening. As highlighted by Polyakov’s research findings, organisations developing AI technologies should implement extra measures to “harden” AI models and algorithms. These could include adversarial training and more advanced filtering. 

What Does This Mean For Your Organisation? 

Although the development of universal jailbreaks as part of research has value in finding vulnerabilities in LLM models and explaining and understanding how they work, it has also highlighted some serious threats and issues. It is a little scary how quickly and easily researchers were able to develop universal jailbreaks for the most popular AI chatbots and demonstrated how these current vulnerabilities could pose safety, security, and potentially legal threats to users as well as raising ethical concerns for LLM developers and highlighting how much more work they need to do quickly to mitigate the risks caused by misuse of their products.  Although worries about threats to privacy from LLMs have already made the news, raising awareness about jailbreaking of these models is another step towards making them safer to use. 

Comments are closed.