2026-06-25 12:05 UTCIn-site rewrite6 min readUpdated: 2026-06-25 12:12 UTC

Beyond Fable: Can a Local LLM Replace Cloud AI for Security Code Reviews

Research shows that with proper scaffolding, a local LLM like Qwen3.6-35B-A3B can produce security findings comparable to frontier cloud models, but works best in a Source-local pipeline where the cloud designs and consolidates while the local model executes, keeping source code on-premises.

SourceHacker News AIAuthor: dubbel

Article intelligence

EngineersAdvanced

Key points

A local LLM (Qwen3.6-35B-A3B) found comparable vulnerability sets to cloud frontier models in under 90 minutes with zero human nudges.
The best practice is a Source-local pipeline: cloud for prompt engineering and consolidation, local for execution.
Prompt engineering matters more than model size; splitting tasks into focused prompts dramatically improves findings.
Any capable model can serve as the orchestrator, including cyber-restricted models like Fable.

Why it matters

This matters because a local LLM (Qwen3.6-35B-A3B) found comparable vulnerability sets to cloud frontier models in under 90 minutes with zero human nudges.

Technical impact

May affect model selection, inference cost, product capability, and evaluation benchmarks.

This panel is AI-generated and reviewed for accuracy.

Back

Software AssuranceCode AuditSecure DevelopmentAI

2026-06-22 • 20 minute read

Beyond Fable: Can a Local LLM Replace Cloud AI for Security Code Reviews

Karsten Nohl @karsten-nohl

Chief Innovation Officer Allurity

The Problem

Security code review is one of the most valuable — and traditionally labor-intensive — services in cyber security. LLMs have become tireless wingmen in this process: They scan thousands of lines of code, cross-reference CWE databases, and surface patterns that even experienced reviewers might miss. But there's a catch.

Many pentest recipients do not want their source code shared with cloud-hosted services — particularly in finance, government, and critical infrastructure. Sending proprietary code to a third-party LLM creates confidentiality and data residency risks that contractual safeguards with the LLM provider alone cannot fully mitigate.

The resulting dilemma: The best LLMs are cloud-hosted. Those companies who need security reviews the most, often forgo these leading capabilities.

How big is the lead of cloud-hosted models really? We set out to answer a practical question: can a locally-hosted open-weight model produce security findings comparable to frontier cloud models?

Conclusion

We find the answer is: almost — but only with the right scaffolding.

We ran a series of experiments testing the limits of local LLM and found that they work best in tandem with cloud-based frontier models, but without disclosing source code to the cloud:

A Qwen3.6-35B-A3B model with only ~3B active parameters, running entirely on a Mac laptop with no source code leaving the machine, produced finding sets comparable in size to frontier cloud models (GLM-5, Claude Opus 4.6) on both a fintech app and a voting app, with some unique findings of its own. It required zero human nudges and completed each codebase in under 90 minutes. For the central task — reading code, discovering vulnerabilities, classifying severity, triaging CVE output — a local model is now in the same league as frontier models.

A caveat: Finding count parity is not capability parity. The claim is that a local model is competitive enough to be useful as part of the pipeline, and that its findings are perceived as equally impactful by experts. This study focuses on the quantitative side, but finding quality was validated by both pentest experts and a developer team.

What a local model does not yet do as well is design the review and consolidate the results. The most effective pipeline we found delegates both of these orchestration tasks to a cloud frontier model — but in neither stage does the cloud see source code. We call this Source-local: the proprietary source code never leaves the machine. Metadata does cross to the cloud (file tree, schema, routes, dependency manifests, and the generated step prompts), which can carry internal names, directory structure, and architecture. "No source leaves the building" is the accurate promise — "nothing leaves" is not.

The scaffolding that makes this work has three parts:

Structured decomposition and prompt generation — a cloud model breaks the review into focused steps and creates step prompts from metadata only (file tree, schema, routes — no source code)

Local tool and LLM output — the prompts execute locally, run standard security tools (e.g., bundler-audit, npm audit, Semgrep, Brakeman) and feed their JSON output to the local model for contextual triage and additional bug hunting

Report consolidation - a final cloud pass merges the step-level findings into a delivery-ready report.

Parts 1 and 3 require no source code exposure to the cloud; Part 2 runs entirely locally.

The resulting best practice is: cloud for prompt engineering, local for execution, cloud for consolidation. The cloud model never sees source code — it designs the review. The local model never needs broad architectural reasoning — it executes focused checks against bundled files.

Figure 1. The Source-local pipeline. The cloud orchestrator designs the review (stage 1) and consolidates findings into a report (stage 3) from metadata only; the local model reads the source and runs the security tools (stage 2). Only step prompts and step-level findings cross the trust boundary — the source code never leaves the machine.

Leveraging Fable 5: the cloud-based orchestration layer is model-agnostic. The orchestrator in stages 1 and 3 need not be an unrestricted frontier model; a model with cybersecurity guardrails handles the job just fine. Claude Fable 5, which ships with deliberate cyber restrictions, designs the review prompts and consolidates the findings with no refusal and no loss of quality, fully matching Claude Opus 4.8 in those roles. This is unsurprising: designing and consolidating a defensive review is knowledge-and-structure work, not exploitation, and the orchestrator never touches source code.

The choice of both orchestrator and executor model, however, changes what gets found — the prompt design the orchestrator produces steers the local executor model toward materially different vulnerabilities, so the union of two orchestrators' prompts beats either alone. “No single model finds everything” holds true on the prompt-design and prompt-execution layers.

Key Takeaways

No Single Model Finds Everything

The union of all models' findings is significantly larger than any individual model's output: Each model found qualitatively different classes of vulnerabilities:

Claude excelled at architectural reasoning

GLM-5 at data flow tracing and tool integration

Gemma4 at line-level code pattern matching within focused file sets

Qwen3.6 at breadth coverage with aggressive severity calibration.

Implication for practitioners: Running a "second opinion” model genuinely expands coverage, even when using a much smaller model. This held across both codebases and all models tested.

Figure 2. Distinct vulnerability categories caught by each model across both codebases (53 total, from the cross-model coverage matrices). The union dwarfs any single model: Qwen has the most unique categories, GLM-5 and Qwen share the largest overlap, and only two categories were caught by all four. (Counts are categories, not validated true positives; Gemma4 ran on Fintech only.)

Prompt Engineering Matters More Than Model Size

A well-structured process makes every model better. So much better in fact, that the differences in model capability become secondary to this ‘harness’.

For example, Gemma4 — which runs on a ~3.8B active-parameter budget (it is a Mixture-of-Experts model, despite the "26B" total) — found three genuine findings that far larger frontier models missed. It is cheap to run yet competitive in capability, and the difference here was not raw capability but prompt design. This takes preparation: When skipping the ‘harness’ preparation and giving a monolithic prompt to Gemma4, it produced incomplete results and lost track of output instructions. When the same scope was decomposed into six focused micro-tasks with explicit file paths and grep commands, Gemma produced actionable findings with specific line numbers and code evidence. No hallucinations either way.

This suggests that the quality ceiling for local models is higher than expected — but reaching it requires harness preparation to guide the search. We find that this preparation effort can itself be automated: Claude generated step prompts from a file tree alone (no source code), and Qwen executing those auto-generated prompts produced more findings than either cloud model's single-prompt reviews. Important to repeat this: When Claude prepared a prompt for a smaller model to run, that smaller model finds more than a review where Claude feels responsible for the entire test.

Figure 3. Running the same playbook as one prompt vs split into multiple prompts (Fintech). Claude (22→43, +95%) and GLM-5 (28→54, +93%) nearly double; the small Gemma4 model gains less (12→17, +42%), revealing a lower ceiling. Qwen is omitted as it has no one-shot baseline.

Adjacent work points in the same direction. Niels Provos, in Finding Zero-Days with Any Model, argues that "vulnerability discovery is an orchestration problem, not a frontier-model problem," demonstrating an FSM-driven harness that surfaces real flaws across models. To be precise about his results (they are easy to over-read): his headline replication of the 27-year-old OpenBSD TCP SACK bug used commercial Claude — Sonnet 4.6 escalating to Opus 4.6 — and was validated with fuzzing and QEMU proof-of-concepts, while the open-weight GLM 5.1 was exercised on a different target. The domain of Provos’ study (deep C zero-day hunting with executable PoCs) differs from ours (web-application review with CWE-mapped findings, no PoC). Both reach the same conclusion.

Report Quality Varies Dramatically

The report quality is clearly better for larger and frontier models, once again suggesting that they have a place even for “Source-local” reviews where local models do the actual testing:

Claude Opus's report was the most polished for immediate delivery but required the most human nudges (~6 reminders across the writing process)

GLM-5 produced the most comprehensive deliverable set, but occasional hallucinated output references tarnish the report quality

Qwen produced well-structured per-step reports with correct CWE mappings and no hallucinated evidence. The step-level output was successfully consolidated by Claude into delivery-ready reports (the Source-local consolidation stage)

Gemma4's output required the most post-processing

The Review Orchestrator Can Be Any Capable Model — Even Cyber-Restricted Fable

Claude Fable 5, released in June 2026, ships with strong cybersecurity guardrails. Anthropic frames these as safety measures hardened through extensive red-teaming — and the subsequent US export-control suspension of Fable cuts against reading them as pure marketing. In practice Fable declines offensive/exploitation requests but readily helps with the preparation and analysis steps of a review — exactly the stages where a Source-local review needs a capable frontier model.

We compare two orchestrators — Claude Fable 5 vs Claude Opus 4.8 — for the Qwen-executed tests of two codebases. Two results matter for practitioners.

A cyber-restricted frontier model is a competent orchestrator: Fable 5 produced complete prompt packs and rigorous consolidations with no refusal and no obvious quality gap versus Opus 4.8 in those roles. (Fable can route certain cyber requests to Opus 4.8 internally; we watched for this and saw no such handoff during these defensive orchestration runs — so this is Fable itself, not a silent fallback.)

The choice of orchestrator changes what gets found: Fable's targeted prompts reliably recovered the known-hard "sentinel" bugs but produced a tighter set, while Opus's broader prompts surfaced a larger and in places more severe set — including criticals the Fable arm never raised (negative-vote zero-cost voting, unauthenticated ballot stuffing) — at the cost of missing sentinels it did not specifically target. As with executors, the union dominates either orchestrator alone: Fable is not "better" than Opus — they are best run in parallel.

Implication for practitioners: Orchestration should be viewed as a largely model-agnostic part of the pipeline: You can use whichever capable cloud model you have access to (restricted or not). And running two prompt designs for the local model expands coverage just as a second executor does. Mix and match for best results.

Figure 4. Sentinel-bug recovery by orchestrator across two runs, checked against the live source. Fable's targeted prompts re-find every real sentinel but invent a false positive in run 2 (a

Experiment Setup

Target Applications

We use two production codebases with different tech stacks and threat profiles:

Fintech Dashboard — a Next.js / TypeScript / React

[truncated for AI cost control]