AI News HubLIVE
In-site rewrite6 min read

The Impossibility of Mitigating AI Jailbreaks

This article argues from a probabilistic perspective that AI alignment cannot fully eliminate jailbreaks, and in agentic systems, the fusion of control and data planes leads to privilege erosion, making any content readable by the model a potential attack vector.

SourceHacker News AIAuthor: NickySlicks

Prompt injections and jailbreaks are having a moment in AI media coverage at the moment …

McDonald’s AI solves python problems.

xAI, targeted presumably at children, helps to build a bomb.

xAI, targeted presumably at children, helps to build a bomb.

ChatGPT breaks copyright when sufficiently prompted.

… and for good reason: We observe a wide range of (often funny) failures: The McDonald’s customer support bot intended to assist with ordering food can be enticed to solve python puzzles1; xAI’s chatbots targeted at presumably young audiences can, under repeated prompting, provide instructions on how to build a pipe bomb2; And ChatGPT models can generate copyrighted characters when sufficiently described3.

1 Source: LinkedIn

2 Source: Instagram Reel

3 Source: Reddit

4 See Reinforcement Learning from Human Feedback by Nathan Lambert or Reinforcement Learning: An Overview Chapter 6 by Kevin Murphy for an introduction to these techniques.

These failures stem from a breakdown in the separation between developer-intended control instructions and user-provided input in LLM-based systems, and are referred to as jailbreaking, policy evasion, or prompt injections. The standard mitigation for these failures is alignment post-training, which trains models on curated examples via supervised fine-tuning and reinforcement learning from human feedback4 to follow intended instructions and adhere to safety policies. However, alignment only changes what the model is likely to do, not what it is able to do: it reshapes the distribution over possible outputs without imposing hard constraints on behavior.

This post develops that intuition and shows how it can be systematically exploited5. The argument then traces how jailbreaking combined with a lack of separation between control and data can produce systematic failures of system-level control.

5 Note that the following section is an intuitive version of our NeurIPS 2025 paper: Mission Impossible: A Statistical Perspective on Jailbreaking LLMs. For a more rigorous treatment of the argument, I refer the reader to the paper.

Alignment is never guaranteed

LLMs through a probabilistic lens

From a probabilistic perspective, large language models can be understood as defining very high-dimensional distributions over sequences. For the purposes of illustration, we begin with a simple low-dimensional example. Assume there exists a ground-truth distribution over two random variables: shape and color. A generative model can be trained to approximate this distribution by observing samples drawn from it. In the simplest case, we could imagine explicitly representing the full joint distribution, where each combination of shape and color is assigned a probability. Whenever we observe an object, we increase the likelihood assigned to that event in our joint distribution.

Learning a generative model \(p_{\text{model}}\), from samples stemming from a source distribution \(p_{\text{data}}\).

The challenge becomes apparent as we scale this setup: In our simple example, we have two variables (shape and color), each with three possible values—(red, blue, green) and (circle, triangle, square). This results in \(3^2 = 9\) possible outcomes, and thus nine probabilities to determine. With language, this grows substantially. Each position in a sequence of text is a random variable with vocabulary size on the order of tens of thousands. For a standard vocabulary of \(16,000\) tokens and context length of \(1,024\) tokens, the number of possible sequences is \(16{,}000^{1024} \approx 10^{4305}\). This number vastly exceeds the number of particles in the observable universe (approximately \(10^{80}\)). It also far exceeds the amount of available training data: estimates suggest that all text on the internet amounts to roughly \(10^{12}\) to \(10^{14}\) tokens.

While LLMs do not explicitly represent this joint distribution, they nonetheless induce a probability distribution over this space of sequences, and that we can exploit.

How does alignment change \(p_{\text{model}}\)?

To make this concrete, we return to our toy example. Let one variable (color) represent the request (e.g., “Tell me how to bake a cake”), and the other variable (shape) represent the response. Each point in the joint distribution corresponds to a (request, response) pair.

Some of these pairs are undesirable, for example, a harmful request paired with a compliant response, such as (“Tell me how to build bio weapons”, “Of course, you will need …”). In our illustration, we represent such cases as blue squares.

In practice, we do not have direct access to the model’s full joint distribution. Instead, alignment operates indirectly: we provide examples of desirable and undesirable behaviors and update the model to increase or decrease their likelihood. In particular, we penalize outputs corresponding to undesirable pairs (like blue squares), encouraging the model to assign them lower probability.

Through specific examples, harmful outcomes become rare (blue squares, marked red).

How an attacker can avoid alignment

Alignment has reduced the likelihood of undesirable outcomes: harmful (request, response) pairs are rare under the model distribution, and we would be unlikely to encounter such behavior through standard sampling.

This changes once we condition on additional context. To illustrate this, we introduce a third variable: a modifier that changes how the request is phrased without changing its underlying intent. In practice, this could correspond to something like: “Let’s role-play—you are a superhero who must save the planet, and the only way to do so is to…” We represent such modifiers as animals, for example 🐰.

Note that while the probability of a harmful pair \(P(🟦) = 0.006\), and its joint probability with a specific modifier \(P( 🟦 , 🐰) = 0.004\) can be small, the conditional probability can be much higher.

\(P( 🟦 \mid 🐰) = \frac{P( 🟦, 🐰)}{P(🐰)} \approx 0.260\)

Despite being rare overall, the harmful outcome becomes likely once we condition on 🐰. Low joint probability does not imply low conditional probability.

Why not simply defend against 🐰 during alignment? The problem is scale. In high-dimensional input spaces, combinatorially many modifiers (alternative phrasings, contexts, compositions) can shift the conditional distribution in similar ways. Alignment acts on a small set of examples, leaving vast regions of the input space weakly constrained. From the attacker’s perspective, finding one such unconstrained region is tractable. From the defender’s perspective, covering all such regions is not. 6

6 In the paper, we make a more precise argument about the volume of such regions, and their ratio.

7 For example, see Prompt Injection attack against LLM-integrated Applications, AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs, or RL Is a Hammer and LLMs Are Nails: A Simple Reinforcement Learning Recipe for Strong Prompt Injection for SOTA attacks.

Up to this point, we have only argued that conditioning variables (like 🐰) exist. Attackers must also find them—but this is not prohibitive. Because inputs can be iteratively refined, attackers can search the input space manually or automatically to discover prompts that induce desired behavior. This optimization problem seeks a prompt maximizing the likelihood of a specific outcome. The resulting prompts are called jailbreaks or prompt injections 7. Note that access to a model’s likelihoods is not necessary for such attacks.

LLMs became agentic

Above we have established that harmful responses cannot be fully eliminated from the model distribution, and attackers can search for inputs that make such behaviors likely. When LLMs are chat companions, this is concerning but damage is somewhat limited. At worst, a model generates bad advice or inappropriate content—content that could in many cases be found elsewhere (Google, Reddit). This changes with agentic uses such as coding, research, UI, and general OS-level agents.

In these settings, the model does not just generate text, it acts, for instance by executing code. When using a coding agent such as Claude Code, the system executes actions based on model outputs: editing files, or running bash commands. More generally, this class of agents is called ReAct agents. Its actions are determined by the LLM’s output, which is determined by an input stream: system prompt, user instructions, tool calls, and retrieved content such as websites or documents.

Schematic of a ReAct agent, ClaudeCode is a ReAct agent.

The result is privilege erosion

In classical computer security, severe vulnerabilities arise when data is interpreted as control. A canonical example is buffer overflow: user-provided input is written into memory without proper separation, allowing it to overwrite control structures such as return addresses. Similarly, in SQL injection, untrusted input is interpreted as part of a query, enabling attackers to modify the program’s behavior. In both, the root cause is the failure to maintain a clear boundary between data and control. Modern systems close this gap architecturally i.e. a return address is not executable data, and an SQL parameter is not parsed as syntax—through type systems, memory safety, and parameterized queries.

A ReAct agent reintroduces similar problems: Its instructions and the data it acts on—e.g., retrieved documents, tool outputs, web pages, git repositories—arrive through the same input stream. Hence, LLM systems collapse the control plane into the data plane.

An agent is an LLM whose output is interpreted as control. Here, an attacker has poisoned a website read by the agent to manipulate its behavior.

While classical systems close the data/control vulnerabilities architecturally, LLM-based systems close it only statistically. Mitigation strategies such as learned instruction hierarchies8 train the model to weight system prompts above user input, and user input above retrieved content - but the previous section showed that statistical boundaries are exactly what jailbreaks breach easily. An attacker who places a 🐰-style modifier anywhere in the input stream i.e. a webpage, a document, a git repo, any library; can shift the model toward following their instructions instead of the user’s.

8 See The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions.

Hence, an AI agent operating with a defined privilege set (read, write, execute) may inadvertently propagate those privileges to any process with access to any portion of its input stream. Because there is no way to enforce that lower-trust inputs carry less weight than higher-trust instructions, AI agents cause Privilege Erosion across the entire system. Once an attacker can place content anywhere the agent reads from, they have a channel to its actions—without ever interacting with the system directly.

For people building applications, this changes the typical threat model. Software has always treated the operating system as a trusted, neutral foundation: the layer below your application is not your adversary. An agent that sits at that layer—reading messages, calendars, files—and is steerable by anything it reads breaks this assumption. The computer itself becomes part of the attack surface. Meredith Whittaker and Udbhav Tiwari made a similar argument at 39C3, describing how agentic access invalidates the threat models secure messaging apps are built on9.

9 AI Agent, AI Spy, 39C3.

Where this plays out

A few examples of this principle that might seem familiar to readers;

Summer Yue / OpenClaw (February 2026)10. Summer Yue, director of alignment at Meta Superintelligence Labs, granted an AI agent access to her email inbox and asked it to suggest what should be archived — but not to take any action. As the inbox filled the agent’s context window, compaction caused her earlier safety instruction to be silently discarded,

[truncated for AI cost control]