Red-Teaming after Mythos — Zico Kolter & Matt Fredrikson, Gray Swan
OpenAI boardmember Zico Kolter and Gray Swan CEO Matt Fredrikson join swyx to explain why AI security is not just “cybersecurity with AI,” why agents introduce a new class of vulnerabilities, and why the next major AI incident may be a gray swan: unlikely, but clearly visible before it happens. They discuss prompt injection, automated red teaming, model robustness, agent identity, and the emerging AI insurance/compliance stack.
AI Engineer World’s Fair regular bird tix will sell out ~today! Join us next week ahead of the Late Bird price hike and get >$40,000 in sponsor credits for attending!
Thanks to the US Government issuing an export control directive on Mythos and Fable, the risks of jailbreaks and (industry term) indirect prompt injection are suddenly the talk of the town, though we have been covering AI security for a few years now, from Hackaprompt to the enigmatic Pliny the Elder.
Zico Kolter, member of OpenAI’s board of directors on the Safety & Security Committee, and Matt Fredrikson, CMU professor and CEO of Gray Swan, co-authored the definitive paper on Indirect Prompt Injections, and Gray Swan were cited authorities on the Mythos model card, directly investigating the exact capabilities that are under scrutiny right now:
We seized the opportunity to ask them the state of AI Red Teaming, and Shade, the adversarial red teaming tool that Anthropic used to evaluate the robustness of their models against prompt injection attacks in coding environments. Shade is part of their overall toolkit covering Simon Willison’s Lethal Trifecta, including Cygnal, an AI guardrails product, and the world’s largest AI Red Teaming Arena, including AIRT celebrity Wyatt Walls.
All of this security tooling, and yet, we’re only staving off the inevitable.
The risks of extremely smart AI increasingly feel like gray swan events: an event that everyone can see coming.
In this episode, Gray Swan cofounders Zico Kolter and Matt Fredrikson join swyx to explain why AI security is not just “cybersecurity with AI,” why agents introduce a new class of vulnerabilities, and why the next major AI incident may be a gray swan: unlikely, but clearly visible before it happens.
We go deep on prompt injection, automated red teaming, model robustness, agent identity, computer-use agents, enterprise guardrails, and the emerging AI insurance/compliance stack. Zico and Matt also explain why frontier models are not automatically safer as they scale, why specialized red-teaming models can now beat humans at breaking AI systems, and why the future of AI security may depend on AI systems attacking, defending, and interpreting other AI systems.
We discuss:
Why AI systems need a different security mindset from traditional software
How prompt injection creates a new exploit class for agents like Codex and Claude Code
Gray Swan Arena and the rise of community red teaming
Shade: AI that can outperform humans at breaking models
Why LLMs are an alien form of intelligence that fail differently from humans
Human vs browser-agent robustness and why humans ranked fourth
Why eval awareness and capability elicitation matter
Cygnal: Gray Swan’s guardrail model for policy enforcement
Why bigger models do not automatically become more robust
The lethal trifecta: untrusted data, private data, and exfiltration
Why “just prompt it better” is not enough for enterprise AI security
OpenClaw, computer-use agents, and the agent security nightmare
Agent-native identity, permissions, and enterprise deployment
Why AI security may become part of insurance and compliance
Why the first major AI prompt-injection breach may be inevitable
Gray Swan
Website: https://www.grayswan.ai/
Zico Kolter
X: https://x.com/zicokolter
Website: https://zicokolter.com/
LinkedIn: https://www.linkedin.com/in/zico-kolter-560382a4/
Matt Fredrikson
Website: https://www.mattfredrikson.com/
LinkedIn: https://www.linkedin.com/in/matt-fredrikson-7596349/
Timestamps
00:00:00 Introduction
00:02:31 Why AI Security Is Different
00:06:38 Testing Claude, Codex, and Prompt Injection
00:07:47 Gray Swan Arena and Automated Red Teaming
00:11:14 AI That Breaks Models Better Than Humans
00:14:00 LLMs as Alien Intelligence
00:19:00 Humans vs AI Agents
00:24:35 Red Teaming, Jailbreaks, and Capability Elicitation
00:26:11 Cygnal: Guardrails for AI Agents
00:34:04 The Lethal Trifecta
00:39:31 Can AI Automate AI Research?
00:45:47 OpenClaw and the Computer-Use Security Problem
00:50:44 Agent Identity, Permissions, and Enterprise AI
00:54:24 The Future of AI Security
01:00:30 AI Insurance and Compliance
01:04:32 The Gray Swan Event Everyone Sees Coming
01:06:04 Closing Thoughts
Transcript
Introduction: Gray Swan, AI Security, and CMU
Swyx [00:00:00]: We’re here in the studio with Gray Swan, Matt and Zico. Welcome.
Zico [00:00:08]: Great to be here.
Matt [00:00:09]: Thanks for having us.
Swyx [00:00:10]: You’re visiting from Pittsburgh? The home of all good computer science. I don’t know if I’m overstating things. A very strong university.
Zico [00:00:18]: CMU has been the center of a lot of AI since really the dawn of the field.
Swyx [00:00:22]: Especially a lot of self-driving and some language learning. Congrats on your Series A. You’re here because you’re attending Snowflake Summit, and Snowflake is one of your investors. Let’s introduce crisply at the top: what is Gray Swan, and what have you chosen as your startup domain?
Matt [00:00:42]: At Gray Swan, our mission is to empower everyone to use AI safely and securely. Large language models are software, and if you want to deploy them or build applications on top of them, you need to understand the vulnerabilities and what can go wrong. That includes everyday mistakes, like an agent making the wrong tool call, but also worst-case scenarios where an attacker has an incentive to make your agent misbehave, leak data, or steal credentials. Gray Swan grew out of our research at Carnegie Mellon, where Zico and I have spent over a decade studying new vulnerabilities and attack surfaces in deep learning systems: how to test for them, understand their severity, and make inference more robust.
Adversarial Examples and Why AI Security Is Different
Swyx [00:02:05]: Honestly, a very fruitful area of study for any academic. Throwback, this is 10 years ago, which is basically the entirety of me. I got a lot of inspiration from Ian Goodfellow, a friend of the pod, and this is one of those initial adversarial settings.
Matt [00:02:23]: This paper was directly inspired by Ian’s work.
Swyx [00:02:29]: Zico, what about your side of the story?
Zico [00:02:31]: Like Matt, I have been faculty at Carnegie Mellon for a while. Fundamentally, we believe in the transformative power of AI. It has already transformed the software ecosystem, and it will transform many other ecosystems going forward. The issue is that these systems behave very differently from the software we are used to. I do not just mean that AI can find vulnerabilities in software, though it can. I mean that AI systems have inherent vulnerabilities of their own. They can be tricked in ways people can be tricked, so you need a different security mindset.
Zico [00:03:23]: This matters especially when there is the possibility of correlated failures. It is not just that there are many AI systems out there; it is that everyone is using a few models. If you find vulnerabilities in agents that everyone uses, like Codex and Claude Code, you have a new class of exploit. The labs are doing a lot of work here, but when a new platform emerges, a separate security system often emerges alongside it. That is where we are with AI: there is a need for specifically minded AI safety and security providers, and the demand is only going to grow.
Treating Models as Untrusted Systems
Swyx [00:04:55]: I want to highlight right at the top that this is not a cyber episode in the traditional sense. A lot of people looking at the title might think that, but you’re actually trying to treat these models inherently as untrusted entities?
Zico [00:05:11]: Exactly. This is a common conflation because AI is also good at cybersecurity problems, both solving them and causing them. But AI systems themselves introduce new vulnerabilities. Gray Swan is not about using AI to make your cyber infrastructure better; it is about understanding and mitigating the security risks you bring in when you adopt and deploy AI.
Matt [00:05:49]: A big part of that is how people are using artificial intelligence. Once you build entire autonomous systems on top of models and integrate them into your larger platform or network, you have a potential cybersecurity risk. The goal is to mitigate the risk posed by the AI as it relates to your broader cybersecurity goals.
Testing Claude, Codex, and Indirect Prompt Injection
Zico [00:06:17]: Part of this is red teaming. One reason we reached out to you was that you were involved in the Claude Mythos preview, where you were one of the authorities on IPI, or indirect prompt injection. When you receive a model, it does not have to be Mythos, but that is the most prominent one right now: what do you do with it?
Matt [00:06:38]: We do a range of things. In the Mythos case, the concern from Anthropic was how robust the model is to indirect prompt injection. If you operate a coding agent and use Mythos as the model, it will fetch untrusted content and read text you do not control. How robust will it be at staying true to its original objective and not getting hijacked? We also help frontier labs test their safeguards for issues like cyber misuse. Broadly, we provide adversarial safety and security evaluations so model builders can assess progress from one iteration to the next.
Zico [00:07:37]: They also do this in-house, and Anthropic is very ideologically inclined to do it. What do they choose to outsource versus keep in-house?
Gray Swan Arena and Automated Red Teaming
Matt [00:07:47]: So there are two things that I think, we stand out for. One is the Gray Swan Arena. So we operate a community of red teamers. We provide, prize challenges. a lot of these come from the needs of the lab sponsors. so to an extent gamify red teaming objectives, put up a prize pool, and pay people when they find ways to circumvent and violate whatever the safety and security objectives of the model developers were. So that’s, that’s one. It’s, it’s a really great community, like 15,000 people come and hang out on the Discord server. Not all of them take part in every competition, but a lot of a lot of good data and good signal is provided to the upstream model developers through that community. The second is the automated red teaming that we do. So we train, a family of models to be very effective and rigorous at doing automated red teaming, both of the base model, right? So just thinking of it, as a turn-based, chatbot without tools or anything, and agents built on top of it. And it hasn’t been saturated yet, so when the frontier labs come to us, we’re still able to find ways to indirect prompt injection or jailbreak or just generally get their models to do things that they wouldn’t want to.
Zico [00:09:11]: Did you say without tools?
Matt [00:09:12]: With and without tools.
Zico [00:09:13]: With and without tools.
Matt [00:09:13]: So we definitely operate on On agents as well.
Zico [00:09:16]: Obviously that would be more useful.
Matt [00:09:17]: Yep. that’s, that’s actually a fairly recent thing. For a while, what we would help, the frontier labs with was more just, chat-based interactions, going around their content safety policies and what is in their model spec. Now the focus is very much on agents and tool use and all the downstream applications that people want to build on top.
Shade: Automated Red Teaming Models
Zico [00:09:39]: This is a inspired topic. I wonder if there’s any such thing as, on policy red teaming where our models from the same family, same data set, more capable of red teaming themselves.
Matt [00:09:51]: That’s an interesting question. We unfortunately we do have the ability to test that out on smaller open-source models.
Zico [00:09:58]: So generally speaking, the issue with this is that frontier models are extremely bad at automated red teaming Because they have a lot of safeguards built into them. So if you tr
[truncated for AI cost control]