2026-06-18站内改写7 min readUpdated: 2026-06-18

We Got Anthropic's Glasswing at Home (Who Needs Mythos 5 or Fable 5?)

Inspired by Anthropic's Glasswing, the author built Lucent, a staged source-code bug-hunter that runs on a local 27B Qwen model on a single RTX 3090. First run against hermes-agent: 1,342 static hits narrowed to 126 leads, then to 15, and finally 2 real bugs. Local read cost ~$1.62. The best moment: a reviewer agent caught that three earlier exploits were scored against an outdated threat model. Detailed pipeline, hardware, and lessons included.

SourceHacker News AIAuthor: Seventeen18

← All posts

Summary

Anthropic has Glasswing, an autonomous security researcher. I wanted one on hardware I own, so I built Lucent: a staged source-code bug-hunter whose high-volume reading runs on a local 27B Qwen on a single RTX 3090, served by Lucebox at roughly 3.4× decode speed. I pointed it at hermes-agent. A static pass threw up 1,342 hits; the local sweep cut that to 126 candidate findings; a frontier-model adversarial audit triaged 15 leads down to the 2 that are real and in scope. The local reading billed about $1.62. The best moment of the engagement was a reviewer agent catching that I had been scoring three earlier exploits against a threat model the vendor had quietly rewritten.

Anthropic has Glasswing. We have Glasswing at home.

Anthropic has Glasswing: an autonomous security researcher that reads a codebase on its own and comes back with real vulnerabilities. I wanted one too. Not rented by the API call, but running on a machine in the room with me, on models I control and can leave grinding overnight for the price of electricity.

This is how I built that machine, what its own telemetry says it did, and the two real bugs the finished version found the first time I pointed it at a serious target.

The honest headline before the build story: against hermes-agent, a static pass flagged 1,342 candidate sites, the local sweep narrowed those to 126 leads, and an adversarial audit cut them to 15 and then to the two bugs worth disclosing: an approval prompt that anyone in the chat can answer. The other is also in scope and reported to Nous, but its fix is still landing, so I am holding its details until it ships. The sharpest result of the engagement came from a reviewer agent. Partway through, after I had already "demonstrated" three other exploits, it caught that I had been grading them against a security policy the vendor had replaced six weeks earlier, which deleted most of my wins.

The first version was bad

The first attempts were barely automated. I drove a big cloud model, Opus 4.7, by hand against a real target and asked it to find bugs. It produced a confident pile: five findings and a top-ten list, almost all of it false positives and dead ends. The most convincing one was a path traversal in a PDF-extraction routine that could drop a .pth file and turn into code execution at the next Python startup, a clean chain on paper. It fell apart the moment I checked the one thing I should have checked first, whether the upstream library normalizes the filename before it writes. It does. The traversal never reaches disk.

This is the normal failure mode for a model run without checks: high confidence, mostly wrong. A bigger model does not fix it. The bottleneck is discarding bad leads fast enough to keep up, so I stopped working in a chat window and built a pipeline.

Building Lucent

There was an open-source starting point to borrow from, and I did, briefly. It did not do what I needed, and by the time the thing was finding real issues I had rewritten most of it. I call it Lucent.

Lucent is not a conversation. It is a staged pipeline, each stage free to run a different model, with the target's source read-only-mounted inside a locked-down Docker sandbox:

Rank. Score every source file for how likely it is to hide a vulnerability, so the expensive stages spend their budget where it matters. On a large tree this is the difference between a run that finishes and one that does not.

Hunt. A tiered pool of file-parallel agents reads the ranked files and records leads, each pinned to a file:line and a described mechanism. This is the highest-volume stage, and it runs on the local model.

Verify. An adversarial pass re-reads each lead against the source and tries to disprove it: wrong mechanism, not reachable, an artifact of the harness. Nothing advances until it survives this.

Exploit. Survivors go to exploit triage and a variant loop that tries to produce a working proof of concept.

Nothing is called a finding until it reaches the top of an evidence ladder: suspicion → static_corroboration → crash_reproduced → root_cause_explained → exploit_demonstrated. The top rung means a script that runs and shows the behavior against a live instance. Everything below that is a lead, and leads are cheap. The verify stage is the one that did the most work, and most of what follows is about why.

The rig: one GPU, a local 27B, Lucebox

The high-volume reading runs on a single RTX 3090. The ranker and the hunters drive a local open Qwen3.6-27B served by Lucebox, which uses speculative decoding: it drafts several tokens ahead and verifies them in a batch, so it commits multiple tokens per step instead of one. On this card, against the 27B at 4-bit, that works out to roughly 3.4× faster generation on code-like text (about 130 tokens per second, against 38 for plain autoregressive decoding), peaking past 4× on the most regular files. That is the difference between a 27B that reads source fast enough to point at a large tree and one that does not. For reference, the speculative-decoding paper this borrows from reports 4–5× on smaller models on a datacenter card; 3.4× on a 27B at 4-bit on a consumer 3090 is the version you can own.

1RTX 3090

the whole hunting rig

27B

local Qwen, the hunter

~3.4×

Lucebox decode speedup on code

130tok/s

vs 38 autoregressive

$1.62

metered cloud, the local hunt

15 → 2

leads to real bugs

The reading runs on one consumer GPU. A frontier model does the adversarial judgment.

Local does not mean only reading. The ranking, the hunting, and the verify that kills most of the leads all run on that 27B; the model that throws out all but 126 of the 1,342 static hits is the local one. A frontier model, Opus 4.7, sits on top: it orchestrates the run and does the final deep audit of the few survivors, building the proofs and arguing each one down. Each stage is free to run a different model, so that top layer is a choice rather than a floor. It earns its place for one narrow reason: a reviewer that tries to break a finding has to be at least as sharp as whatever built it, or it waves everything through. Everything beneath it stays local, because reading and triaging an 873k-line tree is the work that scales, and metering that by the token is what makes this kind of research expensive.

Ripping out Monty, wiring in Hermes

The pipeline came with an agent loop called Monty, the driver that decides what to read next and how to reason about it. It was rigid where I needed it to improvise, and it fumbled reasoning that spanned several files, so I tore it out and dropped in NousResearch's Hermes Agent, then spent a while tuning it to drive the hunt the way I wanted.

This is worth pausing on. The agent I picked to be the brain of the hunter is the same NousResearch project, hermes-agent, that I would later turn the finished hunter loose on. That was not planned. I chose Hermes because it was the best agent I had on hand, got to know its internals while making it mine, and only afterward pointed the result at the project itself. Knowing a codebase that well, from the inside, is part of why the hunt went where it did.

Weeks of it not working

None of this worked for a long time. The honest version of the timeline is weeks of the pipeline doing something wrong: the ranker burying the interesting files, the hunt stage inventing file:line references that did not exist, the verifier waving through things it should have killed, the whole run wedging because the Lucebox gateway is single-flight and would stall under load and need a restart. A cold sweep of a large tree takes hours, so every bad assumption cost the better part of a day to find.

I fixed them one at a time. I taught the verifier to distrust its own inputs, including files on disk, which mattered more than I expected (more on that below). I made Lucent checkpoint mid-run, so a stall resumes instead of starting over. I added a triage layer that recognizes when a framework or protocol library neutralizes a bug before the dangerous code can run, so those stop being reported as findings at all. Somewhere in there it went from producing confident garbage to producing a short list worth reading.

First real target: hermes-agent

hermes-agent is NousResearch's open-source personal agent: a daemon that ingests messages from many channels (Telegram, email, Slack, Matrix, Feishu, and others) and is allowed to run shell commands, write files, execute code, and install "skills" in response to them. The attack surface is the gap between a large set of untrusted-input paths and a large set of privileged actions. The tree is about 873k lines across 2,903 source files, the kind of messy, real codebase I built Lucent to chew on. I read the source and never modified it, and reported what I found to Nous Research.

Static analysis alone flagged 1,342 candidate sites before any model ran: 72 it called critical, 105 high, 1,162 medium, 3 low, and zero verified, because a pattern matcher cannot verify anything. That number is the input to the pipeline, not an output. The ranker pushed the most suspicious files to the top, the hunters read down from there, and the funnel narrowed:

The full triage funnel: 1,342 static-analysis hits shed down through 126 candidate findings and 15 leads to the 2 real bugs (finding 13 here; the second held until its fix ships).

How the hunt ran

The narrowing from 126 to 15 is the adversarial audit, and it is worth showing its cost because it is the opposite of the local sweep: bounded, frontier-model, and where the real spend is. About 20 specialized agents ran across the engagement: six recon auditors in parallel, each owning one surface (auth, data flow, output); one architect that turned leads into a build plan; six builders that each wrote one runnable proof of concept in its own git worktree; and six reviewer passes whose only job was to tear findings apart. One long-running orchestrator on top. That is roughly 1.5 million tokens across the 13 agents I have clean telemetry for, past 2 million counting recon and orchestration. Individual agents made 20 to 96 tool calls; the builder for the API-server bug alone made 87 before it was satisfied.

It is not a clean afternoon. Two builders died to a harness error mid-run and were recovered from disk and relaunched. The account hit a usage limit partway through, armed an auto-resume, and picked up when the limit reset. About 1h45m of measured agent-compute, but that is a sum, not a wall clock, because the builders and reviewers run in parallel. Leaving that mess out is how a writeup ends up reading like a product demo.

Most leads do not survive

The first sweep produced 126 candidate findings, and almost all of them were wrong. The convincing ones were the problem, the leads that looked solid enough to waste a day on.

The findings run 01–15 in a single sequence across this writeup, so the rows below are not contiguous: this table collects the dead ends, the real-but-out-of-scope behaviors come after it, and the complete list is in the scorecard at the end.

# What it looked like What it actually was

01 env-var exposure in web research False positive: a grep match, not a real flow

02 marker-pdf extract → .pth RCE chain Retracted: the library normalizes the filename upstream

03 WhatsApp-bridge path traversal Retracted: Baileys strips it before the bridge sees it

04 dashboard markdown XSS Not exploitable: React and Chromium eat it

05 a cluster of high-sev hits FP / latent (one nugget foreshadowed the second in-scope finding)

07 vision tool local-file read Retired: MIME-gated, superseded

Two of those retractions are worth the detail, because they are exactly the bugs you ship if you skip the last check. Finding 02 is the .pth chain from the first version, found again by the pipeline and killed the same way: marker-pdf normalizes the image filename before the vulnerable os.path.join ever sees it,

[truncated for AI cost control]