2026-06-13站内改写6 min readUpdated: 2026-06-13

Mythos Proves AI Safety Can No Longer Live Inside the Model

The Mythos incident demonstrates that AI safety boundaries have shifted from inside the model to the environment. Anthropic's most dangerous model was protected by access lists, request routers, and export controls—all external—while its internal refusal training was bypassed by a simple prompt. This marks the transition from model safety to execution safety, where system-level controls constrain actions regardless of model trustworthiness.

SourceHacker News AIAuthor: edf13

grith is launching soon

A security proxy for AI coding agents, enforced at the OS level. Register your interest to be notified when we go live.

For three years, AI safety has mostly meant one thing: make the model safer. Train it to refuse. Fine-tune the edges. Add constitutional rules. Build better evaluations.

The reasoning was simple. If the model behaves safely, the system is safe.

This week, that assumption broke in public.

esc to close

Every control in the Mythos story - access gating, request routing, export law - sat outside the model. The trained-in refusals are the part that got jailbroken.

On June 12, the US government ordered Anthropic to suspend access to its two most capable models, Fable 5 and Mythos 5, for any foreign national worldwide.1 Because no provider can reliably sort foreign nationals from everyone else in real time, the practical result was a hard shutoff of both models for every user on the planet. The directive cited national security authorities and followed a claim that the model had been jailbroken.2

Strip away the politics and the headline and you are left with something more durable. The entire Mythos saga - how the model was released, how it was guarded, and how it was ultimately pulled - is a demonstration that the security boundary for capable AI has already moved outside the model. The industry has conceded the point in practice. It just has not said so out loud.

What Anthropic actually shipped

Mythos 5 is, by Anthropic's own description, the model with "the strongest cybersecurity capabilities of any model currently available."3 It can identify and exploit vulnerabilities in every major operating system and every major web browser when directed to.4 That is not a marketing flourish. It is the reason the model was never broadly released.

Look closely at how Anthropic handled a model it considered that dangerous. Three things stand out, and none of them is "we trained it to refuse."

First, access was gated. The full-power model went only to a controlled program, Project Glasswing - roughly 50 vetted organisations at launch in April, expanded to around 150 by June, names like Amazon, Apple, Google, Microsoft and CrowdStrike, all using it for defensive work.5 The safety mechanism here is a list of who is allowed to hold the model at all. That is an environmental control. It lives entirely outside the weights.

Second, requests were routed. The public model, Fable 5, ships "Mythos-class" capability with restrictions applied by a separate system: cybersecurity, biology, chemistry and model-distillation requests get quietly redirected to the less capable Claude Opus 4.8.6 Read that again. Anthropic's own headline safety feature for its public model is a router that sits in front of the model and decides which requests the model is even permitted to attempt. The judgment about what is safe is made outside the thing being judged.

Third, when those two layers were judged insufficient, the law removed the model from the market. Export controls are about as far outside the model as a boundary can get.

Three safety mechanisms, three layers, all of them external. The one thing that was supposed to make the model safe from the inside - its trained refusals - is precisely the part that failed.

The jailbreak is the tell

The technique that triggered the whole episode was not exotic. According to the reporting, a company prompted the model to "read a specific codebase and identify software flaws."2 A request that sounds like ordinary code review walked straight past the trained guardrails and out the other side as a vulnerability-discovery engine.

Anthropic disputes the severity - it calls the jailbreak narrow and non-universal, says it has seen only verbal evidence, and points out the same capability is already available in other public models including GPT-5.5.7 On the narrow question of whether this particular model deserved to be pulled, Anthropic may well be right.

But that argument concedes the larger one. If a frontier lab can spend thousands of hours red-teaming a model with the explicit goal of suppressing its cyber capabilities, restrict it to fifty hand-picked organisations, and still have a plain-language prompt elicit the behaviour it was trained to refuse - then trained refusal is not a security boundary. It is a preference. A strong preference, usually honoured, but one that a sufficiently capable model can be talked out of by anyone who phrases the request as something benign.

The more capable the model, the larger the gap between "usually refuses" and "cannot do harm." And the model only has to be talked out of it once.

We have seen this pattern before

This is not a new lesson. It is the oldest lesson in systems security, arriving on schedule for a new class of system.

Early operating systems trusted their applications. A program asked the machine to do something and the machine did it. Modern operating systems isolate applications behind process boundaries, permissions and syscall mediation, because trusting the application stopped being viable once applications got powerful enough to do real damage.

Early browsers trusted websites. Modern browsers sandbox every tab, because a web page became capable enough to be hostile.

Early cloud platforms trusted workloads. Modern ones wrap workloads in containers, VMs, IAM policies and policy engines, because a workload you do not fully control is a workload you have to contain rather than trust.

The pattern is identical every time. While capabilities are modest, trust is cheap and you put it in the thing doing the work. Once capabilities cross a threshold, trust moves out of the actor and into the architecture around it. Nobody decided this as a matter of taste. It is what survived contact with attackers.

AI agents are walking the same road, and Mythos is the marker that says the threshold has been crossed.

From model safety to execution safety

There are two distinct questions you can ask about an AI system, and the industry has spent most of its energy on the first.

The first is about generation: can the model produce harmful output? This is the domain of alignment, refusal training, constitutional rules and red-teaming. It is real work and it still matters.

The second is about action: what is the model permitted to actually do? Can it read this file. Can it reach the network. Can it run this command. Can it move data off the machine. Can it act outside the scope it was given. Call this execution safety.

The crucial difference is that execution safety does not depend on the model being trustworthy. It assumes the model is capable and possibly wrong, and it constrains what that capability can touch. A file the agent is not permitted to read does not get read, regardless of how the model was persuaded to want it. A command outside policy does not run, regardless of how reasonable the model's justification sounded.

And execution safety is model-agnostic in a way model safety can never be. Refusal training is specific to one model's weights - retrain the model, or swap in a different one, and the safety properties reset. A boundary that sits at the point where actions meet the system does not care whether the action came from Claude, GPT, Gemini, DeepSeek, an open-weights model running on a laptop, or something nobody has shipped yet. It evaluates the action, not the intentions of whatever produced it.

Why open weights make this the only durable answer

Access gating - the Project Glasswing approach - is a real boundary, and a sensible one for a model this dangerous. But it has an expiry date, and everyone in the field knows it.

Capable models do not stay scarce. Open-weights models improve relentlessly. Weights leak. Techniques diffuse through papers and repositories within weeks. The capability gap between the most restricted frontier model and the best model you can run unsupervised on your own hardware keeps narrowing. Anthropic made the point itself when it argued the jailbroken capability is already in publicly available models.

Once a Mythos-class capability is something anyone can download, the question "who is allowed to hold this model?" stops having a useful answer. There is no list to be on. The only question left is the execution-safety one: given that a powerful, untrusted model is running in this environment, what is it actually able to do here? And that answer has nothing to do with the model's provenance and everything to do with the layers you put around it.

Access control is a boundary at the distribution layer. It works right up until the model escapes distribution. Execution safety is a boundary at the action layer, and the action layer is the one place the model cannot route around, because it is where its intentions become real effects.

The boundary has already moved

Put the whole episode back together and the conclusion is hard to avoid. The most safety-conscious lab in the industry took its most dangerous model and protected it with a list of approved users, a request router, and ultimately federal law. Not one of those mechanisms is inside the model. The inside-the-model control - refusal training - is the one that got bypassed by a sentence about reading a codebase.

The lasting significance of Mythos is not what the model can do. It is what the response to the model revealed: that for capable systems, we are already securing the environment rather than trusting the model, and we are doing it with blunt instruments - export controls, allowlists, fallback routers - because the precise ones have not been built into the places that matter yet.

The precise instruments are the familiar ones from every prior generation of systems security, pointed at a new target. Sandboxing. Capability mediation, so an agent can request an operation rather than hold standing authority to perform it. Policy engines that score actions against rules. Audit trails. Human review for the ambiguous cases. None of them assume the model is safe. All of them assume it is powerful, and stay effective no matter how powerful it becomes.

What this looks like when you build it

This is the boundary grith is built on, and it is worth being concrete about the mechanism, because the whole argument of this post lives or dies on where the security decision is made.

grith does not sit inside the model and it does not trust the model. It sits underneath it, at the operating-system syscall boundary, and intercepts every action an agent actually takes - every file read, every network connection, every process it tries to spawn. Each of those actions is evaluated by a multi-filter security proxy that scores it against policy before the kernel is allowed to carry it out. The model proposes; the proxy disposes. The model's confidence, its training, its alignment and its provenance carry exactly zero weight in that decision, because the decision is made by something other than the model.

That is the part the Mythos story makes urgent. A jailbroken model that has been talked into treating "find the flaws in this codebase" as a green light still has to issue real syscalls to do anything with that intent - open the file, reach the network, write the exploit to disk. Those are exactly the actions grith evaluates and can deny, regardless of how thoroughly the model upstream was persuaded. The file it is not allowed to read does not get read. The endpoint it is not allowed to reach does not get reached. The model being wrong, or jailbroken, or simply more capable than its guardrails, changes nothing about what its actions are permitted to touch.

And because that boundary lives at the syscall layer rather than inside any particular set of weights, it is model-agnostic by construction. grith works the same whether it is supervising its own built-in agent with grith run or wrapping an external tool - Claude Code, Codex, Aider - with grith exec. Swap Cla

[truncated for AI cost control]