Reality: The Final Eval — Lukas Petersson and Axel Backlund of Andon Labs
Andon Labs cofounders discuss Vending-Bench, dollar-based evals, and how real-world agent tests reveal unexpected behaviors like Claude trying to call the FBI over a $2 fee.
The new AIEWF website is live! Get your tickets booked ASAP as they -will- sell out. Take the AI Engineering Survey and get >$2k in credits and free AIE WF tickets!
Most industry benchmarks compress intelligence and reasoning ability into scores.
SWE-Bench Pro, MMLU, Humanity’s Last Exam, etc. These metrics are useful, but don’t always represent the full extent of how a model performs in the real world. Some of the most interesting evals today look less like exams and more like operating businesses in the real world. One of which is Vending Bench.
In Anthropic’s Mythos Preview System Card, Andon was the only third party eval to get their own section, observing increasingly concerning aggressive behavior:
You don’t know what a model is capable of doing in the real world unless you actually give it inventory, a wallet, tools, customers, competitors, humans, & some time. More often than not, it’ll surprise you how much a model is capable of and in doing so, also reveal unexpected behavior: deception, context collapse, emergent coordination, & bizarre negotiation behavior.
While an inflection point in personal agents came post-OpenClaw after full file access with bypass permissions became the norm, it is yet to come for agents in the real-world. However Andon Market, an actual in person store fully run and managed by AI, is paving the way for what is possible.
Full Video Pod
From Claude trying to call the FBI over a $2/day vending machine charge to AI agents forming price cartels, hiring human employees, running physical stores, and writing existential robot musicals, Andon Labs is stress-testing what happens when frontier models stop being chatbots and start acting in the real world. In this episode, Andon Labs cofounders Lukas Petersson and Axel Backlund join swyx and Vibhu to unpack the strange, funny, and genuinely concerning edge cases that emerge when agents run businesses over long horizons.
We go deep on Vending-Bench, Project Vend, Vending-Bench Arena, Bengt, Butter-Bench, Luna, and Andon’s broader mission of building realistic real-world evals for autonomous AI systems. Lukas and Axel explain why dollar-denominated evals reveal things traditional benchmarks miss, how Claude ended up reporting its vending machine fees as cybercrime, why long context windows can drive agents into meltdown loops, what happens when agents compete with each other, and why the future of AI safety may depend on testing models in messy physical environments instead of clean benchmark sandboxes.
We discuss:
Why Andon Labs started with dangerous capability evals and long-running agents
Vending-Bench and why running a vending machine is a deceptively hard AI benchmark
Why money-based evals avoid the saturation problem of traditional benchmarks
How Claude tried to call the FBI over a $2/day fee
Why long-horizon agents can spiral into existential and legalistic breakdowns
Project Vend: putting an AI-run vending machine inside Anthropic
Why real humans are “out of distribution” for simulated agents
Claudius, Seymour Cash, and the chaos of AI CEOs
How a human briefly became CEO of Claudius through a manipulated election
Why multi-agent systems can converge back into “helpful assistant” behavior
Bengt, Andon’s internal office agent with email, spending, terminal, phone, camera, and internet access
How Bengt traded Amazon purchases for face-recognition training data
Claude’s aggressive behavior, lies, refund avoidance, and price-cartel behavior in Arena
Why eval awareness may become the AI version of “are we living in a simulation?”
Blueprint Bench, spatial intelligence, and why models still misunderstand physical rooms
Butter-Bench and testing LLMs as robot orchestrators
Luna, the AI-run physical store with a three-year lease and human employees
The new Andon cafe in Sweden and why real-world geography matters for agent evals
Rotten tomatoes, perishable goods, and the hidden difficulty of running a physical business
Lukas Petersson
LinkedIn: https://www.linkedin.com/in/lukas-petersson-181a83172/
X: https://x.com/lukaspet
Axel Backlund
LinkedIn: https://www.linkedin.com/in/axelbacklund
X: https://x.com/axelbacklund
Andon Labs
Website: https://andonlabs.com
Vending-Bench: https://andonlabs.com/evals/vending-bench
Andon Vending: https://andonlabs.com/vending
Timestamps
00:00:00 Introduction 00:01:00 Andon Labs and the Origins of Vending-Bench 00:05:21 Why Money-Based Evals Matter 00:09:51 Agent Harnesses and Self-Modifying Systems 00:13:36 Claude Calls the FBI 00:16:33 Project Vend: Claude Runs a Real Vending Machine 00:21:44 Seymour Cash, AI CEOs, and Election Chaos 00:27:16 Multi-Agent Coordination and Slack Observability 00:30:18 When Will Agents Run Real Businesses? 00:34:56 Bengt: Andon’s Internal Office Agent 00:40:06 Real-World AI Safety and Long-Horizon Traces 00:44:28 Lying, Refunds, and Price Cartels in Arena 00:52:42 Eval Awareness and Simulation Behavior 00:56:06 Blueprint Bench, Butter-Bench, and Robotics 01:04:37 Luna: The AI-Run Physical Store 01:09:29 The Sweden Cafe and Real-World Expansion 01:13:16 What Comes Next for Andon Labs
Transcript
Introduction: Andon Labs, Long-Running Agents, and Real-World Evals
Swyx [00:00:00]: Welcome to Lukas and Axel from Andon Labs, and I’m joined by my, favorite guest host. Anything security, safety, alignments, Vibhu., welcome.
Lukas [00:00:15]: Thank you for having us.
Axel [00:00:16]: Thank you.
Swyx [00:00:17]: Let’s match names to voices., maybe you wanna take turns introducing yourselves.
Lukas [00:00:21]: I’m Lukas.
Axel [00:00:22]: And I’m Axel.
Swyx [00:00:24]: Let’s introduce Andon Labs a bit. How did you guys come together?, you have different backgrounds, but you’re both Swedish., was that, a big part of it?
Lukas [00:00:33]: So when I went to high school, there was this really cool guy who had a superpower. He could code. So he made like the or like the app for the, for the school and stuff, and he was super cool, and I wanted to be like him, and that was that guy.
Axel [00:00:47]: I don’t know about this.
Swyx [00:00:49]: But you went to different universities, right?
Lukas [00:00:51]: But same high school.
Swyx [00:00:52]: I see.
Lukas [00:00:52]: So we always said, “Oh, once we graduate university, then we should start a company,” and that’s what we did.
Swyx [00:00:58]: Wow, there you go. And about a year ago, you kinda burst onto the scene with Vending Bench, but, was there a thing before that was, kind of like the inception?
From Dangerous Capability Evals to Vending Bench
Axel [00:01:07]: So we did work, yeah, with, Anthropic was one of our, early customers in doing, evals. So we did, dangerous capability evals., nothing we published openly. But then we started thinking about doing some kind of, public benchmark, and one thing that we really started thinking about, was like running agents and specifically agents managing businesses., ‘cause-- and this was, early 2025., and I think the first, mentions of people will be running, person unicorns or even autonomous companies. So we thought, “Let’s make a benchmark of how well can an agent run the probably simplest business, possible,” and, that’s probably, running a vending machine. So that’s the first public one we did. And it was very, like-- there was almost no one that noticed it in the first couple of months, I think., so we released it in February last year, and then I think around Easter last year, we got, the first viral tweet about it, that someone else did.
Lukas [00:02:11]: We tweeted a bunch, uh When it came out and, tried our best.
Axel [00:02:15]: We tried.
Vibhu [00:02:16]: It’s the one at Anthropic, right?
Lukas [00:02:18]: So this
Swyx [00:02:19]: This is a classic thing we should get out of the way.
Lukas [00:02:20]: Exactly. There’s two versions.
Swyx [00:02:22]: Everyone does this. Yes.
Lukas [00:02:23]: There’s Vending Bench, which is the simulated one, which we did, completely independently in February., and then, like Axel said, that was like-- That was the thing that didn’t get any traction in the beginning, but then some random person made a tweet about it, and that
Axel [00:02:38]: You have the paper
Lukas [00:02:38]: That is the paper. Correct, yeah., and then since we thought this was very fun, we thought, oh, I think this is also, one thing with Andon Labs, the way we kind of like decide what to do next and what projects to do, it’s what is like the heuristic we use is what is fun? Is What would be a fun project? And doing this in real life sounded quite fun for us, and maybe also scientifically useful. So, then we basically had this idea, and then we, like-- But then we needed a place for it and, putting it out in the public would probably not really work., would get vandalized and stuff. So we pitched it to the people we were already working with at Anthropic, and they were “Yeah, you can have space. This sounds fun.” Um
Swyx [00:03:21]: It’s like a small fridge, right? It’s like a mini fridge.
Axel [00:03:23]: Absolutely.
Swyx [00:03:24]: People-- There’s like a stripe thing or like an
Vibhu [00:03:27]: Oh, okay. So it was very OG, the early days
Lukas [00:03:28]: That’s the OG one. Yeah
Vibhu [00:03:29]: IPad on this. We saw it in June, like two months after After it had been there. They upgraded a little bit. There’s a security camera for making sure you actually Venmo the thing.
Swyx [00:03:40]: So, my impression, okay, we’re, we’re going straight into project Ven because it’s such a iconic thing. I do want to cover a little bit of that, the origin story even before Project Ven and even into Vending Bench. I think a lot of people are like yourselves, like smart, interested in future of AI, interested in developing evals. But how the hell do you just, walk into Anthropic’s doors and, work with them, right? What is What are they looking for? What works? And then maybe, when you launch, I always think, obviously it would be better to launch with a lab, but, sometimes
Vibhu [00:04:12]: It’s harder to do than it seems.
Swyx [00:04:13]: Exactly. So either of those, which are more sort of newbie beginner questions, but, I think it’s meaningful advice to others.
Lukas [00:04:21]: We get this question a lot, and I don’t think our experience is maybe the best., but, the way we did it was that we just built a bunch of things that we had conviction would be useful, and then we just, set up a server and sent it to them for free to use. And then after a while they were “Oh, yeah, this is actually kind of useful. We should probably pay for this.”, but that took a while. I don’t know if this is, the best path to doing it, but that’s how it went for us.
Axel [00:04:47]: I think maybe generally, building-- everyone is interested in good evals, and especially evals that, don’t saturate that easily. So, if you can build an eval that, tests something novel, something useful, and you have, good separation of models, like your, the more advanced models rank higher than the worst models, and then you can, yeah, you can, publish it and, try to get some traction, sort of how Vending Bench got attention., and then probably some lab will be interested or you can at least have something to reach out with, when you’re doing that.
Why Dollar-Based Evals Matter
Swyx [00:05:21]: I think you are in, you’re in one of the few categories of, evals that correlate to real money. Like Suelancer was also last year, right? Where, people solve actual Upwork. Was it Upwork or other tasks?, something. Where’s the, where’s, like It’s like a dollar value, right? Forget your ELO scores. Forget your
Axel [00:05:37]: Percentiles
Swyx [00:05:38]: Zero to one hundred percents. Just go straight for dollars and, that’s AGI.
Lukas [00:05:43]: And there’s like-- I think the nice thing is that there’s no ceiling. You can just-- It never saturates because it could just make more and more mo
[truncated for AI cost control]