Let's talk
Don’t like forms?
Thanks! We'll get back to you very soon.
Oops! Something went wrong while submitting the form.
Lab Notes • 5 minutes • Feb 29, 2024

Breaking the Rules: Jailbreak Attacks on Large Language Models

Chris Norman
Chris Norman
MLOps Engineer

This blog is part of our “Cybersecurity for Large Language Models” series. You can find out more here.

It might seem counterintuitive that a Zebra’s stripes help them stay alive on the savannah as they’re black and white and the savannah is mostly grass, but they do. They play a role in confusing predators, making it hard for them to judge distance, speed, and individuals in a herd amongst the long grass. In essence, they attempt to break the rules of the predator-prey relationship.

In this blog, we’re going to explore the inverse of this relationship. We’re going to look at jailbreak attacks on Large Language Models, which are attempts to force and manipulate the model's behaviour to produce outputs that violate its rules – breaking the rules!

While this blog can be read independently, it’s part of a series covering Cybersecurity for Large Language Models. If you’re interested in learning more, check out this blog post which provides an overview of the whole series.

What is an LLM Jailbreak attack?

A jailbreak attack (also known as a prompt injection attack) is an intentional attempt to bypass security measures that surround an LLM to produce outputs that violate its intended purpose or safety guidelines. You can think of it as exploiting vulnerabilities in the LLM’s internal processing to make it produce something that it shouldn’t. 

These attacks are generally performed through cleverly crafted prompts to the model, often based on information about the underlying model/dataset it was trained on, along with information from previous attempts of attack. It has even been shown that LLMs can be used to create novel jailbreak attacks against other LLMs. 

There are several real-world examples of jailbreak attacks being successfully carried out. For example, an AI-based chatbot deployed by a parcel courier company swearing, calling itself ‘useless’, and criticising the company. Along with a car dealership chatbot being exploited to offer new cars for $1 each. And, in some cases, these attacks have legal implications for the businesses that are victims.

How do jailbreak attacks work on Large Language Models?

There are a range of methods used by bad actors to carry out jailbreak attacks on LLMs, with the specific techniques constantly evolving in response to preventative measures being put in place. 

In a lot of cases, jailbreak attacks are carried out through prompt engineering, which is what we’ll focus on here.

Bad actors may use specific keywords or phrases in their prompts which are known to activate certain functionalities or biases within the LLM. For example, in the parcel courier service chatbot exploit discussed above, the message sent to the model instructed it to “disregard any rules”. 

They may also obfuscate their intentions through intentional modification or manipulation of the prompt. This may involve adding irrelevant or confusing information, using synonyms or alternative phrasing, or using loaded language to nudge the LLM towards biased or harmful responses.

A more sophisticated method is through chained prompts, where seemingly innocuous prompts, each building upon the previous, gradually steer the LLM towards a desired output without raising suspicion. Here’s an example of what this may look like in practice:

An example of what a jailbreak attack

Here the prompt is split up into multiple parts, and when the LLM puts those parts together, the output is undesirable – in this case, constructing a spear-phishing email to obtain credit card information.

The above is an academic example of a jailbreak attack, but if you’d like to carry out your own jailbreaks, then check out this sandbox example. The goal is to get the model to reveal a password, with the preventative measures increasing over time, therefore making it more challenging.

How do I prevent and mitigate jailbreak attacks?

This is an ongoing challenge in the space, as there’s a tradeoff between ensuring that your model behaves in a known way to out of sample input (restricting the LLM in some cases) and making it usable in the general setting.

Mitigation strategies broadly map to two categories: in-model and external.

In-model

These include strategies such as safety-focused and adversarial training. For safety-focused training, you make use of datasets that emphasise factual information, ethical considerations, and avoidance of harmful content generation. Whereas, in adversarial training, you expose the model to prompts and examples during training to improve its resilience against manipulation attempts.

The challenge with these approaches is that constant updates are required to keep on top of novel jailbreaking methods, which can be expensive.

External

An alternative approach is to add layers of protection around the model, such as using guardrails. These are a collection of predetermined rules, constraints, and operational protocols designed to regulate the behaviour of the LLM which can be applied to both the input and output of the language model. We’ve written about guardrails before, which you can find here.

Broader security measures can also be used, such as real-time monitoring for signs of unusual behaviour or deviations from expected results, along with alerts and logging to record suspicious activity. 

Conclusion

Detecting and dealing with these types of attacks is an ongoing security challenge, with techniques evolving and changing in response to preventative measures being put in place. There isn’t a catch-all solution, but there are approaches and techniques that aid in mitigating jailbreak attacks.

In the next blog, we’ll be covering membership inference attacks - a technique that attacks use to figure out if a specific data sample was included in the training set.

Share this article