2026-06-25 03:57 UTCIn-site rewrite4 min readUpdated: 2026-06-25 04:11 UTC

Show HN: Find where multi-agent AI systems break before production

swarm-test is a static reliability testing tool for multi-agent AI systems that identifies failure points like cascade failures, SPOFs, and context leakage without live LLM calls, providing a Swarm Score and interactive reports.

SourceHacker News AIAuthor: surajkumar001

Notifications You must be signed in to change notification settings

Fork 3

Star 3

BranchesTags

Open more actions menu

Folders and files

NameName

Last commit message

Last commit date

Latest commit

History

55 Commits

.github/workflows

docs

examples

swarm_test

tests

.gitignore

.swarmtest.example.yml

LICENSE

README.md

action.yml

areengine_graph.mmd

areengine_graph.png

areengine_swarm_report.md

pyproject.toml

verify.md

Repository files navigation

Find where your multi-agent AI system breaks — before production does.

Static reliability testing for CrewAI, LangGraph, AutoGen, and custom agent systems. No live LLM calls, no API cost.

The problem

Chain 14 agents at 95% reliability each and your system is ~49% reliable end-to-end (0.95^14). The failures aren't inside any single agent — they're in how they connect: silent cascade failures, hidden single points of failure, fragile dependencies. swarm-test finds them by analyzing your agent topology.

Quickstart

pip install swarm-test swarm-test run my_crew.py --open

--open launches an interactive D3 dashboard in your browser the moment the run finishes — Swarm Score, force-directed agent graph with single-points-of-failure pulsing red, sortable health and redundancy tables, and every finding grouped by severity.

No real script handy? Build a synthetic topology straight from the CLI:

swarm-test run -a "Orchestrator,Worker1,Worker2" -e "Orchestrator>Worker1,Orchestrator>Worker2"

What it catches

One agent fails and silently takes down everything downstream — cascade failure

A single agent the whole system depends on; remove it and the swarm splits — blast radius / SPOF

Credentials, PII, or other sensitive data leaking across agent boundaries — context leakage

Agents drifting from their assigned role; prompt-injection-style goal hijacking — intent drift

A slow upstream with no timeout boundary blocking the whole pipeline — timeout resilience

Dense cliques, echo chambers, and cycles that bypass the orchestrator — collusion detection

Agents stuck in loops — runaway step counts and retry storms that burn tokens with no error thrown — trajectory analysis

Output schema mismatches across agent edges — contract violation (opt-in; provide a contracts YAML)

Features

0–100 Swarm Score with a verdict line (EXCELLENT → CRITICAL) — one-line output for CI

Agent role classification (orchestrator, aggregator, validator, gateway, worker, monitor, router) with confidence scores

Role-adjusted severity — a validator leaking context is upgraded; an orchestrator's blast radius is downgraded

Historical tracking — trend across runs, diffs new vs. resolved findings

Interactive HTML report (--open) — D3 force-directed graph, NxN heatmap, filterable findings

GitHub Action with PR annotations and job-summary score

Graph export to Mermaid, DOT, or PNG (SPOFs red, redundant green)

Framework adapters: CrewAI, LangGraph, AutoGen, generic / static graph

YAML config (.swarmtest.yml) and entry-point plugin system

CI gate (GitHub Action)

.github/workflows/swarm-test.yml

on: [pull_request] jobs: swarm-test: runs-on: ubuntu-latest steps:

uses: actions/checkout@v4
uses: surajkumar811/[email protected]

with: script: my_crew.py fail-on-severity: high

Findings appear inline on the PR as ::error:: / ::warning:: / ::notice:: annotations; the Swarm Score is posted to the workflow job summary.

Using it from Python

from swarm_test import SwarmProbe

Works with a CrewAI Crew, LangGraph CompiledGraph, or AutoGen GroupChatManager

probe = SwarmProbe(crew, swarm_name="my-crew") report = probe.run_all() report.print_summary() report.to_html("report.html")

Installation

pip install swarm-test

or with framework extras:

pip install "swarm-test[crewai]" pip install "swarm-test[langgraph]" pip install "swarm-test[autogen]" pip install "swarm-test[png]" # for PNG graph export

How it works

swarm-test builds a NetworkX directed graph from your agent system — nodes are agents, edges are interactions extracted by each framework adapter. All tests are static graph analyses; no LLM calls are made, and results are deterministic given the same topology.

Cascade failure — simulates each agent failing in turn and measures downstream impact.

Blast radius — detects articulation points (graph-theoretic SPOFs) and scores every agent on a 0–100 redundancy scale composed of path redundancy (30%), role uniqueness (25%), tool coverage (20%), betweenness centrality (15%), and degree ratio (10%).

Context leakage — scans interaction payloads against a sensitive-data regex set extensible from .swarmtest.yml.

Intent drift — flags agents whose observed behavior diverges from their declared role; includes prompt-injection heuristics.

Collusion — finds dense cliques, echo chambers, and cycles that bypass the declared orchestrator.

Timeout resilience — identifies long synchronous chains with no timeout boundary.

Trajectory analysis — flags self-loops, ping-pong pairs, multi-agent feedback cycles, unbounded loops with no exit, repeated parallel calls, and cycles deeper than max_trajectory_depth (default 5).

Contract violation — validates agent outputs against JSON schemas declared per edge (opt-in; pass --contracts contracts.yml).

Roles are classified from structural metrics (in/out degree, betweenness centrality) plus naming hints, each with a 0–100% confidence score. Severity is then role-adjusted: an orchestrator with high blast radius is expected and gets downgraded; a validator leaking context is a security incident and gets upgraded.

Output modes & formats

Flag Output

--quiet / -q Headline verdict only (one line). Ideal for if checks in CI scripts.

(default) Headline + test results + critical/high findings + SPOFs.

--verbose / -V Every finding, graph metrics, full health and redundancy tables.

Output formats via --output-format: console, json, markdown, html. The same verbosity setting is configurable in .swarmtest.yml.

Graph export

swarm-test graph my_crew.py --format mermaid swarm-test graph my_crew.py --format dot --output topology.dot swarm-test graph my_crew.py --format png --output topology.png # needs the [png] extra

Mermaid renders inline on GitHub, so you can drop the output straight into a README or PR description. Colors: red = SPOF, orange = moderate redundancy, green = fully redundant.

Historical tracking

Every run writes a small JSON snapshot to .swarmtest-history/. Subsequent runs print a trend line below the headline verdict:

Swarm Score: 72/100 — NEEDS IMPROVEMENT (3 critical findings) Trend: ↑ +18 from last run (was 54) — improving Recent: 54 → 61 → 58 → 72 ✓ 3 findings resolved since last run ⚠ 1 new finding since last run

Browse with swarm-test history show. Disable per-run with --no-history, or globally via history_enabled: false in .swarmtest.yml. .swarmtest-history/ is gitignored by default; commit it if you want the trend to survive across CI machines.

Configuration (.swarmtest.yml)

collusion

sensitive_patterns:

"INTERNAL-[A-Z0-9]+"

output_format: html output_path: ./swarm.html timeout_seconds: 30 strict: false # treat ANY finding as a failure

Auto-discovers .swarmtest.yml, .swarmtest.yaml, swarmtest.yml, or a [tool.swarmtest] table in pyproject.toml. CLI flags always override config-file values. Exit codes from run: 0 (passed), 1 (findings exceed thresholds), 2 (config or runtime error).

Plugin system

Ship custom tests as installable Python packages. Register under the swarm_test.plugins entry-point group; swarm-test auto-discovers and runs them alongside the built-in tests:

[project.entry-points."swarm_test.plugins"] my_custom_test = "my_package.plugins:MyPlugin"

swarm-test plugins list

See examples/plugin_template/ for a runnable starter.

Framework examples (CrewAI, LangGraph, AutoGen, static)

CrewAI

from crewai import Crew from swarm_test import SwarmProbe SwarmProbe(crew, swarm_name="my-crew").run_all().print_summary()

LangGraph

from langgraph.graph import StateGraph from swarm_test import SwarmProbe SwarmProbe(compiled_graph, swarm_name="my-langgraph").run_all().to_json("report.json")

AutoGen

from autogen import GroupChatManager from swarm_test import SwarmProbe SwarmProbe(manager, swarm_name="my-autogen").run_all().print_summary()

Static graph (no live framework)

from swarm_test import SwarmProbe, AgentNode, InteractionEvent, EventType a = AgentNode(name="Fetcher", role="researcher") b = AgentNode(name="Summarizer", role="writer") SwarmProbe( swarm_name="my-swarm", agents=[a, b], events=[InteractionEvent(source_agent_id=a.id, target_agent_id=b.id, event_type=EventType.TASK_DELEGATE)], ).run_all().print_summary()

Links

PyPI: https://pypi.org/project/swarm-test/ — pip install swarm-test

Issues: https://github.com/surajkumar811/swarm-test/issues

License: MIT — free and open source

If swarm-test catches a real bug for you, please star the repo — it helps other teams find it.

About

Chaos engineering & reliability testing for multi-agent AI systems

Topics

python

testing

reliability

multi-agent

ai-agents

chaos-engineering

llm

crewai

langgraph

agent-testing

Resources

Readme

License

MIT license

Uh oh!

There was an error while loading. Please reload this page.

Activity

Stars

3 stars

Watchers

0 watching

Forks

3 forks

Report repository

Releases

1 tags

Packages 0

Uh oh!

There was an error while loading. Please reload this page.

Contributors

Uh oh!

There was an error while loading. Please reload this page.

Languages

Python 99.6%

Mermaid 0.4%