We Ran a Complex Task – A LangChain Repo Analysis with Claude Fable Models
A detailed experiment comparing five Claude models (Opus, Fable, Sonnet, Sonnet 4.6, Haiku) on a full audit of the LangChain Python monorepo. Fable matched Opus in grade (A-) but excelled in generating actionable milestones and quick wins. The article presents findings, strengths/weaknesses, and recommends a multi-model pipeline.
Engineering · Jul 2, 2026 · 11 min read
We Ran a Complex Task — A LangChain Repo Analysis with Five Claude Models
Anthropic just shipped Claude Fable. We wanted a real answer to a practical question:
If you run the same complex engineering task on Opus, Fable, Sonnet, and Haiku — what do you actually get back?
Not a benchmark score. Not a vibe check. A full principal-engineer audit of a production open-source monorepo — with evidence, severity labels, and an execution plan.
We ran that experiment inside CTRL NODE: one prompt, five agents, five models, one cloned repository.
- The goal: one hard task, five models
What we tested
We gave every model the same four-phase audit prompt and the same target: the LangChain Python monorepo (a large, mature library ecosystem — not a toy repo).
The prompt asks for:
Repository Map — explore first, judge second
Audit Report — architecture, security, tests, performance, deps, DX, docs (with file:line citations)
Improvement Strategy — themes, trade-offs, measurable “done” criteria
Task Plan — milestones M0–M3, quick wins, effort/risk/deps on each item
Every finding must be evidence-based. Guessing is explicitly forbidden.
That is a genuinely heavy task: thousands of files, real CI configs, security-sensitive deserialization paths, and god-class modules on hot code paths. It is the kind of work teams normally spread across several senior engineers.
Why Fable vs the rest
Fable is positioned as a strong reasoning model for long, structured work. We included it alongside:
Model Role in the experiment
Claude Opus 4.8 Premium tier — threat modeling baseline
Claude Fable 5 New tier — strategy & execution planning
Claude Sonnet 5 Current Sonnet — primary audit pass
Claude Sonnet 4.6 Previous Sonnet — ops / CI lens
Claude Haiku 4.5 Fast tier — exploration & map
The hypothesis was not “Fable wins everything.” It was: each tier sees different things, and Fable might be the best at turning findings into a shippable backlog.
The prompt
The full prompt lives in our catalog as langchain-prompt.md. Core instruction (abbreviated):
You are a world-class, principal-engineer-level software engineer and technical audit expert. Perform an in-depth analysis of this code repository, provide an honest audit report, and offer a prioritized, actionable improvement plan.
Follow four phases in order: Discovery → Audit → Strategy → Task Plan. All judgments must cite real file paths and line numbers. Do not guess.
Deliverables requested per run:
audit-report-.md — full Markdown report
audit-report-.html — interactive dark-theme dashboard (tabs: Overview, Map, Audit, Strategy, Tasks)
Summary of the prompt: resumen-langchain-prompt.md.
- How we set it up in CTRL NODE
We did not paste the prompt into five browser tabs. We ran it the way a team would: Bridge on a real machine, a project work directory pointing at the clone, one agent per model tier.
Prerequisites
Bridge (ctrlnode) installed and paired — see Bridge setup.
Claude SDK API key set in ~/.ctrlnode/.env (providers load automatically — no PROVIDERS flag needed):
ANTHROPIC_API_KEY=sk-ant-... BASE_PATH=/home/you/workspace
LangChain cloned on the Bridge host under BASE_PATH (CTRL NODE does not git-clone for you; the work directory points at an existing folder).
Project
In the web app: + NEW PROJECT
Field Value
NAME langchain-audit-experiment
AGENT TYPE Claude
WORK DIRECTORY Browse → select the LangChain clone → USE THIS DIRECTORY
DESCRIPTION Five-model audit benchmark
The work directory is what lets agents read the full tree in WORK DIRECTORY task mode — the same scope a staff engineer would need.
Agents (one per model)
Team → + ADD AGENT — we created five agents on the same project:
Agent name MODEL field Purpose
audit-opus claude-opus-4-8 Threat & design review
audit-fable claude-fable-5 Strategy & task plan
audit-sonnet-5 claude-sonnet-5 Primary audit
audit-sonnet-46 claude-sonnet-4-6 CI / ops pass
audit-haiku claude-haiku-4-5 Fast map
Models are selected in the MODEL combobox (synced from Bridge when online) or typed manually. Fable appears as claude-fable-5 in the Bridge model manifest (v2026.2.4+).
Optional AGENT SYSTEM INSTRUCTIONS were left minimal — we wanted the task prompt to carry the spec, not per-agent persona drift.
- How we ran the prompt
For each agent, same procedure:
+ NEW TASK on the project
TITLE: LangChain principal audit —
INSTRUCTIONS: paste full contents of langchain-prompt.md
ASSIGN TO AGENT: pick the matching agent chip
OUTPUT MODE: WORK DIRECTORY (full repo scope; optional focus paths left empty)
NEW TASK → task lands in Backlog
RUN → dispatches to Bridge → agent moves to In progress
Bridge delivers the task with repositoryPaths and repo dispatch context so the Claude SDK runs against the LangChain tree on disk. Outputs (audit-report-*.md / .html) were collected from the agent’s work directory and copied into our marketing catalog folder.
Tip for reproducibility: use the same commit SHA for every run. Our reports reference LangChain master at 2b47357 where noted.
- What Fable returned
Fable graded the repo A− — the same calibration as Opus, more honest than Haiku’s self-awarded A.
Executive summary (Fable)
Top 3 risks
Complexity concentration — five files exceed 1,800 lines; runnables/base.py is 6,574 LOC. High blast radius on every invoke/stream path.
Unsafe-by-default deserialization — langchain_core.load defaults to allowed_objects='core', documented as unsafe for untrusted manifests. Safe options exist but are opt-in.
Type-safety escape hatches — 208 type: ignore comments in langchain-core alone; disallow_any_generics=false weakens the public API contract.
Top 3 opportunities
Flip deserialization default to a safe allowlist ('messages') on the next major version.
Burn down parked lint TODOs (BLE, ANN401, ERA) — enforcement infra already exists.
Decompose the top god files behind unchanged public façades (zero API break).
What stood out
Fable’s differentiator was not a hotter take on security headlines. It was Phase 3 and Phase 4:
Four strategic themes (complexity, switched-off guardrails, safe-by-default trust boundaries, workspace hygiene)
Explicit non-goals (e.g. don’t rewrite vendored mustache.py this cycle — add property tests instead)
Milestones M0–M3 with workload badges (S/M/L/XL), risk, dependencies, and acceptance criteria
Quick wins you could ship in an afternoon (.gitignore for audit artifacts, logger.debug on swallowed AttributeError in callbacks/usage.py, CI ratchet on type: ignore count)
Near-exclusive Fable findings:
Vendored 704-line Mustache engine (mustache.py) with its own security surface
McCabe C90 complexity lint explicitly disabled — no automated backpressure on god-file growth
Thin test breadth vs complexity for langchain_v1/agents/factory.py (56 test files vs 1,891-line factory)
What Fable did not emphasize
Fable did not surface several issues other models caught:
TOCTOU / DNS rebinding on SSRF paths (Opus)
ShellToolMiddleware host execution by default (Opus)
SSRF transport adopted in only two call sites + unprotected graph_mermaid.py fetch (Sonnet 5)
Commented lockfile check in CI _lint.yml (Sonnet 4.6)
Broken README model example / missing SECURITY.md (Sonnet 4.6)
That gap is the point: Fable is not a replacement for a multi-model pipeline.
Full report: audit-report-fable.md · Interactive dashboard: audit-report-fable.html
- How the five models compare
Model Grade Best at Weak at
Opus 4.8 A− Threat modeling (TOCTOU, agent shell defaults, env bypass) CI lockfile, default load(), README gaps
Fable 5 A− Strategy, milestones, quick wins, engineering debt Agent-specific threats, SSRF adoption map
Sonnet 5 B+ SSRF infra vs adoption, silent except, repo hygiene Lockfile CI, README, SECURITY.md
Sonnet 4.6 B+ Ops: lockfile CI, load() default, onboarding docs Newer SSRF adoption analysis
Haiku 4.5 A* Fast LOC map, callback cycles, duplicate translators *Inflated grade; factual CI error on lockfile
*Haiku’s A looks confident on paper. Cross-checking against Sonnet 4.6 showed a wrong claim about lockfile validation in CI.
Exclusive findings matrix (selected)
Finding Op Fb S5 S4.6 Hk
TOCTOU / DNS rebinding ✓ — — — —
Shell host by default ✓ — — — —
SSRF transport ~2 call sites — — ✓ — —
graph_mermaid.py no SSRF — — ✓ — —
Default load() unsafe — ✓ — ✓ —
Plan M0–M3 + non-goals — ✓ — — —
mustache.py / C90 off — ✓ — — —
Lockfile CI commented — — — ✓ ✗ wrong
Callback/tracer cycles — — — — ✓
The pipeline we’d actually use
Haiku → fast map & architecture hotspots Sonnet 5 → primary audit + security adoption gaps Sonnet 4.6 → CI, docs, onboarding landmines Opus → threat review for agent-facing surfaces Fable → merge into one prioritized backlog Human → verify _lint.yml, load.py, README in your checkout
No single model replaces this chain. Paying only for Opus — or only for Fable — leaves blind spots.
Deep dive: comparison-models-report.md
Slide deck for the story
We also built a 14-slide presenter deck for video walkthroughs: model-comparison-presentation.html (←/→ navigate, F fullscreen).
- What this means for CTRL NODE users
Model choice is a workflow decision, not a vanity tier pick. Use Haiku to scout, Sonnet to audit, Opus for threats, Fable to plan — on the same project and work directory.
WORK DIRECTORY mode matters for tasks like this. An output-only sandbox would not have produced file:line citations across CI, core, and partner packages.
Fable earns a slot after discovery, not instead of Sonnet or Opus. Its A− grade matched Opus; its deliverable shape (milestones, ratchets, non-goals) was the most actionable.
Re-run the experiment on your repo — clone under Bridge BASE_PATH, point a Claude project at it, duplicate the task five times with different MODEL values.
- References — all artifacts
The full experiment — every prompt, per-model report, and the comparison deck — is published below as supporting material for this article.
Prompt
File Description
langchain-prompt.md Full four-phase audit prompt (English)
resumen-langchain-prompt.md Prompt summary (Spanish)
Per-model reports
Model Markdown HTML dashboard
Claude Fable 5 audit-report-fable.md audit-report-fable.html
Claude Opus 4.8 audit-report-opus.md audit-report-opus.html
Claude Sonnet 5 audit-report-sonnet-5.md audit-report-sonnet-5.html
Claude Sonnet 4.6 audit-report-sonnet-4-6.md audit-report-sonnet-4-6.html
Claude Haiku 4.5 audit-report-haiku.md audit-report-haiku.html
The prompt asks every model for paired .md + .html outputs. Every model in this batch produced both formats.
Comparison & media
File Description
comparison-models-report.md Full five-model written comparison
model-comparison-presentation.html Animated 14-slide deck (Op · Fb · S5 · S4.6 · Hk)
Try it yourself
Start free — create a Claude project and pair Bridge.
Clone the repo you care about on the Bridge machine; set WORK DIRECTORY.
Register agents with different MODEL values (claude-fable-5, claude-opus-4-8, …).
Paste the audit prompt into INSTRUCTIONS, assign, RUN, compare outputs.
Questions or want us to run this on your stack? [email protected]
Experiment date: 17 June 2026 · CTRL NODE — orchestrate Claude, Copilot, Gemini, Cursor, and more from one control plane.