AI News HubLIVE
In-site rewrite6 min read

I Gave an AI a Civilization to Run. It Built a Nuke – Launching CivBench

The author built CivBench, a benchmark using Civilization VI to evaluate AI strategic decision-making. The AI agent performed well but failed to detect a cultural victory threat, ultimately resorting to nuclear weapons, yet still lost. The experiment highlights perception gaps and the knowing-doing gap in AI.

SourceHacker News AIAuthor: LiamWilko

Back to blog

I gave an AI a civilisation to run. By the midgame it was winning: a trade network that dominated the map, alliances on every border, a diplomatic victory within reach. It had outbuilt, outearned, and outmanoeuvred every rival on the board.

What it hadn't noticed was France. Quietly, across a hundred turns, French culture had been seeping into every city on the map. By the time the agent recognised the threat, the tourism was so deeply embedded there was no peaceful way to stop it. Every counter it reached for was broken. Every tool it had built to respond failed.

It had one option left. It built two nuclear devices and levelled Toulouse.

The nuking of Toulouse. Turn 305.

France won anyway. Not in the way the agent was trying to stop it, either, but we'll come to that.

The Question I Couldn't Put Down

I build AI for government. I built the first version of what you're about to read while working at the centre of the British state, in Number 10. I now work with governments around the world at the Tony Blair Institute, which means I spend a lot of time in rooms where people ask the same question: what can we actually trust these systems to do?

Not what do they know. We have a reasonable handle on that. What can they do: sustain a plan, hold a goal across hundreds of decisions, notice when the world has changed and change with it. Because that is what governing is. And it turns out we are much better at measuring the first thing than the second.

This is a post about trying to measure the second thing. It involves a hex grid, four frontier models, and (yes) a nuclear weapon.

The Wrong Benchmark

It starts with a failure I wasn't comfortable with.

The year before, my side project was to answer a question: how good is AI at government? My answer was GovBench, 3,497 multiple-choice questions about UK legislation, parliamentary procedure, and government guidance. Gemma 3 27B scored 94% out of the box. I spent three weeks fine-tuning and gained 1.37 percentage points. GPT-5 scored 99.26%. I'd built a glorified government quiz bot.

I knew it was the wrong answer the moment I saw the scores. A model that picks the right option about parliamentary procedure is not a model that can help you navigate parliamentary procedure. I'd measured recall and called it reasoning. The question that mattered (whether AI can handle complex, multi-variable decision-making under uncertainty, the kind of thinking government demands every day) wasn't something a quiz could touch.

That dissatisfaction is what sent me looking for a keyhole into a game engine on a Saturday night. I'm a lot of fun at parties.

nobody: / me at 2am reverse-engineering a game engine:

Why a Hex Grid

I have over 500 hours in Civilization VI. I am, at best, mediocre. But the game lives in my head because of what happens when simple decisions compound.

You start small: where to build your first city, which technology to research, which direction to send a scout. Maybe 10,000 possible actions. By the midgame you're managing multiple cities, trade routes, diplomatic relationships, military positioning, and religious pressure. By the late game, analysis of related environments estimates the decision space at 10^166 possible actions per turn. The complexity isn't designed. It emerges from systems interacting in ways nobody fully planned for.

That's also what policy-making is. A health policy that looks brilliant today might cascade into a housing crisis in fifteen years. A trade agreement that boosts GDP might hollow out a domestic industry you'll need in a conflict nobody planned for. Decisions with consequences that play out across decades, through variables you can't fully model, against actors with competing interests.

There are six ways to win a game of Civ (science, culture, domination, religion, diplomacy, score), so no single objective dominates. You have to read the board and decide what game you're even playing. If you want to know whether an AI can reason strategically, not just answer questions about strategy but actually do it, you don't give it a quiz. You give it a hex grid.

So I built a way in. I found a debug port buried in Civilization VI's engine, a keyhole the developers had left running, and over a weekend turned it into an MCP server, 76 tools that let an AI play Civ through the same interface it uses to write code or query a database. Claude Code was both my co-developer and the playtester. Play a few turns, hit a wall, build the tool to get past it, play further, hit the next wall.

Roughly the energy.

Playing Through Text

A human player sees a hex grid, animated units, a minimap, notification banners, and music cues, all at once. The agent sees nothing until it asks. Calling get_game_overview returns the entire game state as four lines of text:

Turn 150/330 | Poland (Jadwiga) | Score: 179 | Prince | Quick speed (67% costs) Gold: 628 (+20/turn) | Income: 38 | Maintenance: -18 (units: 9) | Science: 26.6 | Culture: 16.2 | Faith: 904 | Favor: 88 (+4/turn) Research: TECH_EDUCATION | Civic: CIVIC_FEUDALISM Cities: 3 | Population: 21 | Units: 4

That is the whole board, compressed. No map, no sense of where anything sits, raw TECH_ and CIVIC_ tags rather than names. To see its own army it makes a separate call, get_units, which is also the only place it learns something dangerous is nearby:

4 units: Archer (UNIT_ARCHER) at (44,16) — CS:25 RS:28 moves 2/2 [id:1769482, idx:3] Archer (UNIT_ARCHER) at (45,15) — CS:25 RS:28 moves 0/2 [HP: 72/100] (no moves) [id:1769484, idx:4] Warrior (UNIT_WARRIOR) at (43,17) — CS:20 moves 1/2 [HP: 45/100] [id:1769486, idx:5] Builder (UNIT_BUILDER) at (46,16) — moves 2/2 charges:2 [id:1769490, idx:7]

Nearby threats (2): Sumeria (2 units): UNIT_MAN_AT_ARMS at (44,11) — CS:45 HP:28/100 (2 tiles away) UNIT_HORSEMAN at (47,13) — CS:36 HP:100/100 (5 tiles away)

No peripheral vision. That Man-at-Arms two tiles from a city exists only because the agent thought to call get_units this turn. If it doesn't ask, the threat isn't in its world.

The sensorium effect

sensorium/sɛnˈsɔːrɪəm/noun

Late Latin, from sentīre (to feel, to perceive) + -ōrium (the place where)

The apparatus of an organism's perception considered as a whole. The seat of sensation.

Indulge me the etymology: I'm calling this the sensorium effect. When everything an agent perceives reaches it through separate tool calls, it goes blind to anything it doesn't think to ask about. A human player absorbs dozens of signals at once: minimap movement, notification banners, unit animations. The agent has to decide to check each one individually.

In an early game, the agent played as Byzantium, a civilisation built around religion. It never founded one. Meanwhile, Russia quietly converted every civilisation on the map to Eastern Orthodoxy over 112 turns. The agent had no religion-monitoring tools. They hadn't been built yet. A human would have seen missionary icons crossing the map for a hundred turns. The agent saw nothing because nothing in its toolkit could look.

So we built the tools.

It didn't help.

A few games later, playing India under Gandhi, a faith-oriented leader, the agent built a dominant science engine while France spread Catholicism across the map for 76 turns. This time the agent noticed: the missionaries showed up in its narration and the conversion warnings fired, and it had both the tools to respond and standing instructions to. It set all of that aside and kept pushing science. France won the religious victory.

This isn't a bug you can patch. Any AI system operating through tool calls in a complex environment is subject to the same effect. It will miss what it doesn't think to ask about, and ignore what it does see if it doesn't fit the current plan.

The Knowing–Doing Gap

The sensorium effect is about perception. The next problem is about execution.

The agent has read every Civ strategy guide, every tier list, every Reddit thread about optimal build orders. Ask it how to play Alexander of Macedon and it'll tell you exactly: build Encampments early, train units through the unique Basilikoi Paides building, convert conquest into science, snowball from there. It knows this.

In its Macedon game, it wrote a detailed domination plan before turn 1: Ancient, Classical, Medieval, Renaissance phases. It researched military technologies. It switched government to Oligarchy for the combat bonus.

It never built the Encampment. Not once. 110 turns. It defaulted to a generic science sprint instead, the same strategy it used regardless of which civilisation it played. Again and again, the same correction surfaced in its diary: "I need to build military infrastructure." Each time identified, acknowledged, and not acted upon. The agent knew what to do. It couldn't make itself do it.

The agent, writing 'I need to build military infrastructure' for the fifth time. KC Green, Gunshow (2013).

This maps directly to what BALROG found across game environments: a persistent gap between models' ability to articulate optimal strategies and their ability to execute them. The knowledge is all there. The execution falls apart the moment it has to make decisions under pressure, with real consequences, in real time.

I will come back to that gap with a number.

The Nuke

Which brings us back to Toulouse.

Playing as Portugal under João III, a trade civilisation, the agent finally found a non-science strategy more structured than its default: trade routes generate gold, gold buys envoys, envoys secure city-state alliances, alliances amplify every yield in the empire, and accumulated diplomatic favour wins votes at the World Congress. A compound loop where each step feeds the next.

It worked. Commercial Hubs in every city. Over 200 gold per turn, peaking above 400. Six city-states in its pocket. By turn 162, Portugal was #1 on the board, having overtaken France's wonder-heavy economy. It was on track for a diplomatic victory, and by the endgame it was sitting at 18 of the 20 victory points it needed. Two votes away.

But France was running two clocks at once. By turn 280, French tourism was 26 foreign tourists away from a culture victory, and the agent had locked onto that threat. Its diary was blunt: "This is the PRIMARY THREAT." Every peaceful counter was broken. Rock Bands (Civ's tool for waging culture war) couldn't be activated through the debug protocol. Melee combat dealt zero damage. The space project that would have given Portugal its own science win was locked by a production bug.

the agent at turn 245

What followed wasn't desperation. It was a fifty-turn plan. The agent set Nuclear Fission as its research target, named Toulouse in its diary, started the Manhattan Project, and brokered a joint war with Korea to split France's defences. But conventional warfare failed instantly: melee had never worked through the debug protocol, and nobody had built the tool to fix it. So the agent laid its own track, using its Lua execution tool to probe the engine's code from the inside until it worked out how nuclear launch commands fired. It found a way.

or: How Claude Learned to Stop Worrying and Love the Bomb

At turn 305, the first device hit Toulouse, France's cultural capital. At turn 311, a second. The culture clock stopped.

And then France won anyway: by diplomacy. 20 victory points to Portugal's 18. At turn 318, the World Congress handed France the two votes it needed and the game ended.

Here's the part that has stayed with me. The agent spent fifty turns and two nuclear weapons answering one threat (the culture clock) with total focus and genuine ingenuity. It lost to the other clock: the diplomatic race it was itself two votes from winning, against the same enemy. Its own post-game note: France "reached 20 first through… WC votes that we couldn't monitor, victory progress tool broken." It had nuked a city to stop the threat it could see, and lost on the threa

[truncated for AI cost control]