Show HN: A local rig to test if AI social simulation predicts reality
A developer created a local rig to test whether multi-agent social simulations (like MiroFish) actually predict public reactions better than a single LLM. Preliminary results with a small model and synthetic cases show that a single LLM ties or beats a crude swarm simulation, and the aggregate 'magic' signals are noise. The rig is open-source and runs on Ollama, highlighting the need for proper calibration in the simulation category.
Notifications You must be signed in to change notification settings
Fork 0
Star 0
BranchesTags
Open more actions menu
Folders and files
NameName
Last commit message
Last commit date
Latest commit
History
1 Commit
1 Commit
cases
cases
harness
harness
.env.example
.env.example
.gitignore
.gitignore
LICENSE
LICENSE
README.md
README.md
requirements.txt
requirements.txt
run.py
run.py
Repository files navigation
Multi-agent "social simulation" engines (à la MiroFish — 16k★, OASIS/CAMEL-AI) promise: feed in a document, spawn hundreds of AI personas, and predict how the public will react — before you ship. The category is hot and well-funded.
One problem: nobody publishes the calibration. The demos show one impressive run on one case and say "look, it predicted!". Does the simulation actually beat just asking a single LLM? Nobody measures it.
This is a small, honest rig that measures it. Runs 100% locally on Ollama (sovereign, no cloud).
⚠️ Read the limitations before the findings. This is a rehearsal, not a verdict. See below.
TL;DR (preliminary — n=5 synthetic cases, local qwen2.5:7b)
On what people will say (sentiment direction): a single LLM ties a crude multi-agent swarm. Both mediocre on hard cases (~60%).
On which objections will surface: a single LLM wins clearly (recall ~98% vs ~70%).
On the aggregate "magic" signals (virality magnitude, polarization) — the things simulation is supposed to be good at: the numbers are noise at this scale. Spearman ρ flips sign between runs (+0.71 ↔ −0.71; +0.82 ↔ +0.10). At n=5, ρ≈±0.7 isn't even significant.
Adding an agent-interaction round (the core MiroFish thesis) did not help in this crude form.
Conclusion: at small scale the "predictive magic" is indistinguishable from a coin flip. That doesn't disprove MiroFish — it shifts the burden of proof onto the category, and gives you a rig to actually test it instead of trusting a demo.
Headline result (5× averaged, local qwen2.5:7b)
Predictor Sentiment dir. Objection recall Objection prec. Magnitude (rank) Polarization (rank)
mini_swarm (no interaction) 64% 71% 62% +0.10 −0.47
single_llm (one zero-shot call) 52% 84% 71% +0.22 +0.05
dumb (always "mixed") 40% 0% 0% n/a n/a
The single LLM is the bar to beat. A crude swarm doesn't.
⚠️ Limitations (front and center — this is the whole point)
n=5, and the cases are synthetic (hand-written, illustrative). This is a methodology rehearsal, not evidence about the real world.
The swarm here is a crude proxy, NOT MiroFish. Real MiroFish has many more agents and richer interaction dynamics. This rig tests naive persona-averaging and a toy interaction round — it does not (yet) test real MiroFish.
One small local model (qwen2.5:7b). A bigger/different model may change everything.
5-point rank correlations are not statistically meaningful. Treat magnitude/polarization here as noise illustration, not signal.
→ To get a real answer you need: dozens of real cases with documented ground truth, multiple seeds, and the actual MiroFish engine. That's the open work.
How it works
Cases (cases/*.yaml): a real stimulus + its known reaction (ground truth).
Predictors (interchangeable): mirofish (the real sim — adapter stub to implement), mini_swarm / swarm_x (crude swarm, no/with interaction), single_llm (the baseline to beat), dumb (sanity).
Metrics: sentiment direction, objection recall/precision (semantic LLM-judge), magnitude & polarization rank correlation.
Report: honest comparison, with --runs N to average away run-to-run noise.
Quick start (local, Ollama)
pip install -r requirements.txt # or: python -m venv .venv && .venv/bin/pip install -r requirements.txt cp .env.example .env # points at local Ollama by default ollama pull qwen2.5:7b
python run.py --predictors single_llm,dumb # baselines, fast python run.py --predictors swarm_x,mini_swarm,single_llm --runs 5 # the real comparison
Open questions / contributing
This rig is only as good as its cases and its sim adapter. PRs very welcome:
Add real cases with documented ground truth (cases/case_01_template.yaml). Prefer post-cutoff events (else the LLM remembers instead of predicting).
Implement the MiroFish adapter (harness/adapters/mirofish.py) — the one real integration that turns this into a verdict on the actual engine.
Run at N≥30 with multiple seeds and report whether the aggregate signals survive the noise floor.
Credit
Built to stress-test the premise behind MiroFish and the OASIS / CAMEL-AI line of work. Huge respect to those projects — this rig exists to help the category prove itself, with method instead of demos.
Why I built this
I'm an infra/DevOps engineer who builds real agentic systems. The agentic-AI space is full of impressive demos and thin on measurement. I'd rather ship a rig that tells the uncomfortable truth than a demo that flatters it. Proof, not claims.
MIT licensed.
About
No description, website, or topics provided.
Resources
Readme
License
MIT license
Uh oh!
There was an error while loading. Please reload this page.
Activity
Stars
0 stars
Watchers
0 watching
Forks
0 forks
Report repository
Releases
No releases published
Packages 0
Uh oh!
There was an error while loading. Please reload this page.
Contributors
Uh oh!
There was an error while loading. Please reload this page.
Languages
Python 100.0%