AI News HubLIVE
In-site rewrite3 min read

Show HN: A local rig to test if AI social simulation predicts reality

A developer created a local rig to test whether multi-agent social simulations (like MiroFish) actually predict public reactions better than a single LLM. Preliminary results with a small model and synthetic cases show that a single LLM ties or beats a crude swarm simulation, and the aggregate 'magic' signals are noise. The rig is open-source and runs on Ollama, highlighting the need for proper calibration in the simulation category.

SourceHacker News AIAuthor: zzvimercm

Notifications You must be signed in to change notification settings

Fork 0

Star 0

BranchesTags

Open more actions menu

Folders and files

NameName

Last commit message

Last commit date

Latest commit

History

1 Commit

1 Commit

cases

cases

harness

harness

.env.example

.env.example

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

run.py

run.py

Repository files navigation

Multi-agent "social simulation" engines (à la MiroFish — 16k★, OASIS/CAMEL-AI) promise: feed in a document, spawn hundreds of AI personas, and predict how the public will react — before you ship. The category is hot and well-funded.

One problem: nobody publishes the calibration. The demos show one impressive run on one case and say "look, it predicted!". Does the simulation actually beat just asking a single LLM? Nobody measures it.

This is a small, honest rig that measures it. Runs 100% locally on Ollama (sovereign, no cloud).

⚠️ Read the limitations before the findings. This is a rehearsal, not a verdict. See below.

TL;DR (preliminary — n=5 synthetic cases, local qwen2.5:7b)

On what people will say (sentiment direction): a single LLM ties a crude multi-agent swarm. Both mediocre on hard cases (~60%).

On which objections will surface: a single LLM wins clearly (recall ~98% vs ~70%).

On the aggregate "magic" signals (virality magnitude, polarization) — the things simulation is supposed to be good at: the numbers are noise at this scale. Spearman ρ flips sign between runs (+0.71 ↔ −0.71; +0.82 ↔ +0.10). At n=5, ρ≈±0.7 isn't even significant.

Adding an agent-interaction round (the core MiroFish thesis) did not help in this crude form.

Conclusion: at small scale the "predictive magic" is indistinguishable from a coin flip. That doesn't disprove MiroFish — it shifts the burden of proof onto the category, and gives you a rig to actually test it instead of trusting a demo.

Headline result (5× averaged, local qwen2.5:7b)

Predictor Sentiment dir. Objection recall Objection prec. Magnitude (rank) Polarization (rank)

mini_swarm (no interaction) 64% 71% 62% +0.10 −0.47

single_llm (one zero-shot call) 52% 84% 71% +0.22 +0.05

dumb (always "mixed") 40% 0% 0% n/a n/a

The single LLM is the bar to beat. A crude swarm doesn't.

⚠️ Limitations (front and center — this is the whole point)

n=5, and the cases are synthetic (hand-written, illustrative). This is a methodology rehearsal, not evidence about the real world.

The swarm here is a crude proxy, NOT MiroFish. Real MiroFish has many more agents and richer interaction dynamics. This rig tests naive persona-averaging and a toy interaction round — it does not (yet) test real MiroFish.

One small local model (qwen2.5:7b). A bigger/different model may change everything.

5-point rank correlations are not statistically meaningful. Treat magnitude/polarization here as noise illustration, not signal.

→ To get a real answer you need: dozens of real cases with documented ground truth, multiple seeds, and the actual MiroFish engine. That's the open work.

How it works

Cases (cases/*.yaml): a real stimulus + its known reaction (ground truth).

Predictors (interchangeable): mirofish (the real sim — adapter stub to implement), mini_swarm / swarm_x (crude swarm, no/with interaction), single_llm (the baseline to beat), dumb (sanity).

Metrics: sentiment direction, objection recall/precision (semantic LLM-judge), magnitude & polarization rank correlation.

Report: honest comparison, with --runs N to average away run-to-run noise.

Quick start (local, Ollama)

pip install -r requirements.txt # or: python -m venv .venv && .venv/bin/pip install -r requirements.txt cp .env.example .env # points at local Ollama by default ollama pull qwen2.5:7b

python run.py --predictors single_llm,dumb # baselines, fast python run.py --predictors swarm_x,mini_swarm,single_llm --runs 5 # the real comparison

Open questions / contributing

This rig is only as good as its cases and its sim adapter. PRs very welcome:

Add real cases with documented ground truth (cases/case_01_template.yaml). Prefer post-cutoff events (else the LLM remembers instead of predicting).

Implement the MiroFish adapter (harness/adapters/mirofish.py) — the one real integration that turns this into a verdict on the actual engine.

Run at N≥30 with multiple seeds and report whether the aggregate signals survive the noise floor.

Credit

Built to stress-test the premise behind MiroFish and the OASIS / CAMEL-AI line of work. Huge respect to those projects — this rig exists to help the category prove itself, with method instead of demos.

Why I built this

I'm an infra/DevOps engineer who builds real agentic systems. The agentic-AI space is full of impressive demos and thin on measurement. I'd rather ship a rig that tells the uncomfortable truth than a demo that flatters it. Proof, not claims.

MIT licensed.

About

No description, website, or topics provided.

Resources

Readme

License

MIT license

Uh oh!

There was an error while loading. Please reload this page.

Activity

Stars

0 stars

Watchers

0 watching

Forks

0 forks

Report repository

Releases

No releases published

Packages 0

Uh oh!

There was an error while loading. Please reload this page.

Contributors

Uh oh!

There was an error while loading. Please reload this page.

Languages

Python 100.0%