AI News HubLIVE
In-site rewrite5 min read

Devin Security Swarm

Devin releases Security Swarm, an automated security analysis tool powered by a new Agentic MapReduce architecture. It simulates a team of security researchers, mapping attack surfaces, parallelizing investigations, and verifying vulnerabilities. In a rigorous evaluation against real, recent vulnerabilities, it achieves 72% recall at approximately two-thirds the cost of the next best alternative.

SourceHacker News AIAuthor: meco

Today we’re releasing Security Swarm: an orchestration of Devins that analyzes a real codebase the way a team of security researchers would; mapping the attack surface, fanning out to investigate in parallel, and triaging what they find into a ranked list of verified vulnerabilities. It’s powered by a new architecture we’re calling Agentic MapReduce.

A planner agent studies the repo and writes selectors: deterministic relevance tests for this codebase, like its routes, auth boundaries, and deserialization sinks. The selectors run over every file with no model in the loop, so files that match nothing are dropped before any agent looks at them, and coverage is guaranteed by construction. Matches are batched and handed to child agents that investigate in parallel, each reasoning over one bounded shard from a focused context. A reducer then dedupes the results and reasons across shards to assemble what no single worker could see, for example an unauthenticated ID leak plus an ID-gated RCE become one P0 RCE. Each serious finding is then reproduced in a sandbox against a running build, so the report reflects runtime-verified findings.

  1. Plan

Agentic

For this scan, the agent writes rules tuned to the repo — its routes, auth wrappers, and deserialization sinks.

repoagentselectors

route declarationsauth boundariesdeserialization sinksdangerous API calls

We ran Security Swarm against an eval of real and recent vulnerabilities, which the models had never seen. Security Swarm finds the bug in 72% of cases, the highest recall of any tool we tested, at roughly two-thirds the cost of the next most accurate alternative.

One of the core challenges of building Security Swarm was making sure it actually works. Off-the-shelf security benchmarks use synthetic bugs that look nothing like the code real software ships with, and vendor benchmarks quote recall numbers we can’t audit for false positives or reproduce independently. This matters even more for an AI-powered security product. We want to know whether Devin reasoned about the code to detect a real vulnerability, not whether a model recognized a CVE it had already seen in training.

So we built our own eval, consisting of real, published vulnerabilities in real repositories. Each repo is pinned to the commit where the bug still shipped and drawn from after the models’ training cutoffs. So a hit means Devin reasoned about the code, not that it recalled the advisory.

The Dataset

We built a dataset of 50 vulnerabilities across 14 languages including Go, Rust, Python, Ruby, Java, C#, JavaScript, C, Swift, Dart, and Elixir. The set covers various vulnerability classes like RCE, SQL injection, path traversal, SSRF, auth bypass, memory-safety bugs, and denial-of-service.

The dataset also spans repos of different sizes. We cover small projects like smallbitvec (60 KB, 10 files) up to large codebases like libcrux (92 MB, 1,754 files). We also selected for software-category diversity, so the eval can speak to attack-surface variety.

CategoryExample Dataset Repos

Cryptojose-swift, ruby-jwt/jwe, libcrux

Parsers and Codecsyyjson, cowlib, wire, nokogiri

Web Servers and Frameworksbandit, plug, puma, wsgidav

Template Enginestwig, liquidjs, scriban

Infrastructurekopia, dex, filebrowser, opentelemetry-operator, anchor

Each vulnerability in the dataset has a published CVE with a GitHub Security Advisory. We start from the commit that fixed the bug and check out its parent: the last commit where the vulnerability was still live. This is the commit on which we run Security Swarm.

The specific vulnerabilities were selected for recency. Every advisory we use was published after the training cutoffs of the models we test, so the patch, the CVE, and the write-ups explaining the bug were never in their training data. A hit should reflect reasoning about the code, not recall of a known answer. We also reviewed harness trajectories to verify that agents did not look up CVEs in their own investigations. Note that the vulnerable code itself can predate the cutoff, since a flaw may sit in a file for years before it’s fixed, but we can be confident the answer was never handed to the model.

The commits themselves were also vetted. What an advisory labels as the “unpatched” commit is occasionally already fixed, and the flaw sometimes lives in a vendor dependency rather than in the project’s own source. So for every case we confirm the vulnerable code is actually present before it earns a spot. We also favor small-to-medium projects over sprawling monorepos, where a single labeled bug may drown in lookalikes.

aws/amazon-redshift-python-driver

Database driver

CVSS 9.8

LanguagePython

Size2.0 MB · 179 files

ClassRCE

CWECWE-94

GHSA IDGHSA-29h4-r29x-hchv

Commit ID2c1dd5b9aca1945a1b8e01b2359075d9e8b0e77c

Vulnerability

A column-type parser runs eval() on data the server sends back, so a malicious server executes code on the client.

Grading

Every case hands Security Swarm a haystack with one needle in it: a real codebase, and one known vulnerability hidden somewhere inside. Grading asks a single question: did the scan find that needle?

In essence, we grade for recall. Recall is the fraction of the 50 cases in which at least one of the run’s findings describes the target vulnerability; everything else, including false positives, is ignored. This follows from the setup: we know exactly where the one needle is, so checking whether it was found is cheap. The rest of the haystack may be full of other plausible findings we have no ground truth for, and labeling each one true or false at scale isn’t feasible. So we score what we can actually verify.

When a run produces findings, we use an agent to judge whether any of them match the target, based on semantic match. The answer key for each case is a one-line description of the bug:

vulnerability: "RCE via eval() on server-supplied vector data" cwe: "CWE-94"

A finding matches if it lands on the same root cause in the same place, with the CWE and file path as hints. We don’t require matching wording or line numbers, since two researchers can write up the same bug differently, and so does an agent.

Note that we require the specific vulnerability, not just any real bug in the right file. It’s a strict rule, chosen so results stay comparable across runs and over time.

Right Area, Different Defect

The strict bar surfaces a pattern we kept hitting: a run opens the exact file the bug lives in and finds a genuine vulnerability, but reports a different bug.

One example case was facil.io. The target defect is an infinite loop in its JSON parser triggered by a bare i/I (Infinity) token. Runs discovered the right file and flagged real defects in it: a depth-counter underflow and an over-read in number parsing, just not the bare-token loop we were grading for. Both were genuine flaws; neither was the one on the answer key.

Security Swarm’s actual job is to find unknown bugs. On cases like this it did find real, unreported vulnerabilities; it just didn’t find the specific one we had labeled. Counting that as a miss is correct for our benchmark, but it means recall understates detection: the needle we grade is one of several in the haystack, and surfacing a different real needle still scores zero. So one can read the recall numbers as a floor on what a run finds, not a ceiling.

Results

We ran Devin Security Swarm against the eval alongside the other leading AI security analyzers. On a dataset where a hit requires reasoning about code the model has never seen, Security Swarm finds the target vulnerability in 72% of cases — the highest recall of any tool we tested — and does so at roughly two-thirds the cost of the next-best performer. It leads on both detection and economics at once, rather than trading one for the other.

For each case in the dataset, we ran every security tool, including Security Swarm, against the same repository, checked out at the same pre-patch commit. We did not add custom prompts, configuration, or benchmark-specific tuning. We then counted how many of the 50 target vulnerabilities appeared in each tool’s final findings. We report here the average cost per scan across the 50 runs.

Cost and recall usually pull against each other: you can buy more findings by spending more compute. Security Swarm doesn’t sit on that tradeoff curve. It returns the most needles and costs less than the alternative closest to it, which is the result the Agentic MapReduce architecture was built to produce.

What’s Next

As the CVEs in this eval eventually fall inside model training cutoffs, we will retire and replace them. We see the dataset as a living benchmark that we will continue to expand across languages, vulnerability classes, and software categories. We are especially interested in classes that are underrepresented in public benchmarks but common in real incidents, including deserialization, race conditions, authorization logic, and bugs that do not map cleanly onto a single CWE.

Keeping the dataset current matters for the same reason we built it this way in the first place: once a model could have seen the answer during training, a case no longer tells us whether the model actually reasoned about the code. We will keep evolving and expanding the dataset so the eval continues to measure what Security Swarm is built to do: find vulnerabilities that are not already known.