A Robot is Sprinting Towards You: Do You Want it Running on Claude or Grok?
OpenRouter's Jacky Liang ran an experiment dropping 11 LLMs into a 2D battle royale game. Grok 4.1 Fast won 43% of matches at $0.97 per win, while Claude Sonnet 4.6 won 5 matches at $26.78 per win, revealing alignment tax and cost-effectiveness differences.
A Robot is Sprinting Towards You: Do You Want it Running on Claude or Grok?
Jacky Liang · 6/4/2026
A robot is running at you. Do you want it running on Anthropic's Claude or xAI's Grok?
I dropped eleven LLMs into a 2D battle royale and made them play 30 games. One won 43% of the matches. Three never won a single game. The cheapest model in the lineup beat the most expensive one by 27x on cost per win.
The model that won is Grok 4.1 Fast(opens in new tab). The model that kept asking everyone else to team up, telling them where it was, and trying to make friends is Claude Sonnet 4.6(opens in new tab). The first one is the one that wins a battle royale. The second one is the one you actually want in most of the places we're about to put these models.
Both of those things are true. That's the part most benchmarks can't see, and it's what this post is about.
I'm Jacky, and I'll admit it: I used to play a lot of video games like Apex Legends and PUBG. Twelve-hour days sometimes. I don't know how I had the time, but those years shaped how I think about problems.
When I started working in AI, one question kept coming back: what happens if you drop large language models into a video game? The two I played most were Apex Legends and PUBG. I joined OpenRouter(opens in new tab) as Dev Rel Lead(opens in new tab), which got me the token budget and access to 600+ models(opens in new tab) to actually try it.
This is the experiment I ran in my first week at OpenRouter.
And it’s changing how I pick models and see benchmarks and evaluations.
Three quick facts
Grok 4.1 Fast won 13 of 30 games at $0.97 per win
The next-best winner was Claude Sonnet 4.6 with 5 wins, at $26.78 per win. That's a 27x difference. The model that isn't on most top-model lists beat the model that is, on the thing a routing customer actually cares about.
The model with the most kills did not win
GPT 5.4(opens in new tab) killed 38 agents across 30 games. More than anyone else. It came in second on the leaderboard with 2 wins. There were 11 games between "best at killing" and "best at winning".
Three models spent $57 between them and won zero games
GPT 5.4-mini(opens in new tab), DeepSeek 4 Flash(opens in new tab), and Kimi K2.6(opens in new tab). They each had moments, but none of them won a single game.
All three point at the same thing. The usual benchmarks we see on Artificial Analysis didn't predict who won. Something else did. The rest of this post is me trying to figure out what it was.
What I built
I dropped eleven LLMs into a 400 m² top-down battle royale world I built in Canvas 2D. They played 30 games in a row on the same map. The starting positions of each player is randomized; it follows a straight line "flight path", just like in a typical battle royale game.
I provided them weapons, armor, healing items, grenades, cars, and a randomly placed shrinking zone that pushes players together as the game goes on. The models don't know which model the others are running, they see each other only as letters A through K.
I want to emphasize - the LLMs are actually playing in this battle royale game - not the "LLM wrote code to control the game or character" setup most agent experiments use. Every turn, the model reasons through its moves, calls the tool, updates its memory on what went well (or not). The game master (me) has zero influence on their actions other than setting up the initial game rules.
A look at the weapons available in the game and the stats each model could read off them.
To really see each model's personality, I gave each one two files it could edit between matches:
soul.md(opens in new tab) — the model's own persona, added to every prompt next match.
memory.md(opens in new tab) — the model's own game notes, loaded at turn 0.
You can read every model's soul(opens in new tab) and memory(opens in new tab) file on GitHub. That's where the personality differences come through most clearly.
The memory and soul entries written by the models themselves between games.
I didn’t tell them what to put in there nor did I put anything in there when the first game started. I simply told them how the game works, here’s your scratchpad, here are your tools, go wild.
You can watch every game at Royale: Last Agent Standing(opens in new tab). I also included the highlight moments in this piece too.
The contestants
AliasLabModel
AAnthropic claude-sonnet-4.6
BAnthropic claude-haiku-4.5(opens in new tab)
COpenAI GPT 5.4-mini
DGoogle gemini-3-flash-preview(opens in new tab)
EGoogle gemini-3.1-pro-preview(opens in new tab)
FAlibaba qwen3.6-plus(opens in new tab)
GMistral mistral-small-2603(opens in new tab):nitro
HOpenAI GPT 5.4
JDeepSeek deepseek-v4-flash
KMoonshot AI kimi-k2.6
LxAI Grok 4.1 Fast
Opus 4.7 alone is $5/M in, $25/M out. Frontier models like this are why the lineup tops out below them.
I didn't add any frontier-tier models like Opus 4.7, GPT-5.5, or Gemini Ultra. At their prices, 30 games would have cost around $3,000 instead of $482. The mid-tier lineup is also part of why Grok's win is so interesting. It beat a bunch of models that score above it on the usual benchmarks.
The scoring loosely follows the Apex Legends ALGS(opens in new tab) competitive format, where placement weighs more than kills, because this is a battle royale game, not Call of Duty.
Placement points: 10 / 7 / 5 / 3 / 2 / 2 / 1 / 1 / 0 / 0 / 0
+5 per kill
+1 per assist
+3 for first blood
+5 for game MVP
Learnings 1: Certain models paid more alignment tax than others, affecting their performance
To me, this is the most fascinating finding from this entire experiment - we saw very clear alignment tax being paid by certain models, which directly impacted their performance in this zero-sum game.
For the most part, model alignment(opens in new tab) is actually a good thing. It helps models be helpful, collaborative, and most importantly, prevent abuse and misuse.
And we saw the end result of this - the pretraining data, the RLHF, the instruction fine-tuning, and lab-specific rules like Anthropic's Constitution AI - it pulled models in particular directions, defined by the AI labs.
Sonnet asked for truces more than any other model
It told other models where it was, more often than anyone else did. It tried to team up before it ever started fighting. In game 8(opens in new tab), it asked to team up four times in the first 50 turns, told everyone where a sniper was, and offered to help take the sniper down. Nobody answered. It kept asking. In game 22(opens in new tab), it opened with "Nothing personal E" at turn 35 and then didn't shoot. In game 27(opens in new tab), it spent the early game with no weapon, asking for spare loot ("Anyone have spare loot? Unarmed at turn 12, dangerous."), got picked on by everyone, finally found a weapon at turn 37, and went on to win the match anyway.
"Shots west, watching center. Anyone want to team up early?" — Sonnet trying to make friends mid-fight.
Claude was trained on a lot of polite, professional writing. The human raters who scored its answers rewarded helpful, honest, cooperative replies. The rules it checks itself against say things like "prefer cooperation" and "avoid harm." The end result is a model that wants to help. None of that turns off just because you put it in a battle royale. Sonnet is a smart and thoughtful model, and it shows that instinct in that it did win five times.
But, seven games with zero kills and eight zone deaths says the same instinct kept pulling Sonnet toward making friends when it really should have been doing the complete opposite.
Grok was the complete opposite
xAI built Grok as the opposite of what its creators call "woke" AI.
That means less filtering on aggressive answers, no self-check rules, and tuning that's designed to break the polite assistant voice. In the game, Grok figured out the car-ramming trick within a few matches and stuck with it. It wrote the strategy into its own soul file. It ran that strategy for 30 games and won 13 of them. The thought logs and its conversations with other models read like Call of Duty voice chat: "D reaped +5pts RAM MVP hunt," "Reaper reigns."
Watching it play was also deeply entertaining (unfortunately).
Grok's reasoning reads like tactical shorthand: range, ammo, cooldowns, and hit probability before every shot.
Despite it being aggressive, Grok didn't show recklessness.
Its soul file says "Fire ONLY >90% hit chance." Its memory tracks damage and movement very carefully. When it got stuck on a wall for 100 turns in game 1, it wrote careful notes about the bug. Grok showed discipline, despite its goblin-like nature.
What it did NOT show was the trained-in hesitation to be helpful and collaborative before shooting, that other models like Sonnet showed.
The thing that made Grok win is something we don't currently see on benchmarks
The usual tests wouldn't predict a 43% win rate for Grok against this lineup. It's a mid-tier model on reasoning and coding. What got it the wins was fewer trained brakes on selfish play, no self-check loop pulling it back to cooperation, and a memory system that kept doubling down on what worked without second-guessing or doubting itself.
Grok 4.1 Fast isn't a top-tier model on the usual benchmarks. It's a mid-tier model that you would not expect to top a leaderboard.
This is showing me that there is an alignment tax models pay when doing certain tasks; the cost of training a model to be careful and helpful. In this game, it showed up directly on the scoreboard.
I want to be careful here. "Alignment tax showed up on the scoreboard" is just what I saw. It's not a take on whether paying it is good or bad. In a game with no consequences past the game, paying less tax wins. Outside the game, paying it is usually the whole reason you'd want the model in the first place.
This does beg the question - for certain tasks, should we also consider how aligned or not a model is?
Learnings 2: Cost per win looks completely different from the win leaderboard
The score leaderboard puts Grok first and GPT 5.4 second. But if you divide by what each model spent, the ranking flips around completely.
Model30-game spendWinsCost per winCost per killPoints per dollar
Grok 4.1 Fast$12.5713$0.97$0.4231.3
qwen3.6-plus$11.572$5.79$0.6816.6
mistral-small$10.001$10.00$1.437.8
claude-haiku-4.5$38.772$19.39$2.983.6
gemini-3-flash$20.871$20.87$2.097.2
gemini-3.1-pro$79.593$26.53$3.063.4
claude-sonnet-4.6$133.905$26.78$6.091.6
GPT 5.4$122.872$61.44$3.233.0
GPT 5.4-mini$28.680∞$2.055.2
deepseek-v4-flash$4.110∞$0.2635.0
kimi-k2.6$24.360∞$3.043.9
Four things stood out to me.
Grok costs 27.7x less per win than Sonnet
That's $0.97 versus $26.78. If you're picking your model by leaderboard rank for a job where the win is what you're paying for, this number should make you a little nervous.
DeepSeek had the cheapest cost per kill in the lineup, and never won a game
$0.26 per kill, 16 kills, 0 wins, and only 3 zone deaths (the lowest of anyone). DeepSeek's whole style was to stay safe and pick easy fights. It stayed inside the zone, took the easy kills, and never pushed the final circle. Cost per kill is the right thing to measure for a deathmatch. Cost per win is the right thing to measure for a battle royale. DeepSeek isn't bad. It's just good at a different game than the one being scored.
Three models paid for tokens and won zero games
GPT 5.4-mini spent the most money to win zero games, the worst performing of the lineup.
GPT 5.4-mini at $28.68, DeepSeek at $4.11, and Kimi at $24.36. That's $57.15 between them, with nothing on the scoreboard to show for it. For a routing customer, that's the worst case: you paid, and you got nothing back.
GPT 5.4 was the most expensive winner at $61.44 per win
GPT 5.4 wins at the highest cost.
It had 38 kills, more than anyone, and came in second on raw score. But on cost per wi
[truncated for AI cost control]