AI News HubLIVE
In-site rewrite3 min read

When is an AI agent's approval prompt a security boundary?

A security researcher reported three approval-bypass vulnerabilities to the open-source Hermes Agent project. After submission, the project rewrote its security policy to redefine the approval prompt as a non-boundary heuristic, closing the findings as out of scope. The article examines industry inconsistency, contrasting Hermes's response with Anthropic's treatment of a similar bypass as a high-severity vulnerability, and raises the question of whether the human confirmation step is a security control or a convenience in default deployments.

SourceHacker News AIAuthor: nrig

Instantly share code, notes, and snippets.

NikosRig/ai-agent-approval-prompt-as-a-security-boundary.md

Created June 23, 2026 21:00

Show Gist options

Download ZIP

Star

0 (0)

You must be signed in to star a gist

Fork

0 (0)

You must be signed in to fork a gist

No results found

Clone this repository at

Save NikosRig/b4330ceb780fe22bf3c14f38d7d90795 to your computer and use it in GitHub Desktop.

No results found

Clone this repository at

Save NikosRig/b4330ceb780fe22bf3c14f38d7d90795 to your computer and use it in GitHub Desktop.

Download ZIP

When is an AI agent's approval prompt a security boundary? A disclosure timeline + an industry inconsistency.

ai-agent-approval-prompt-as-a-security-boundary.md

When is an AI agent's approval prompt a security boundary?

I reported three approval-bypass findings to an open-source AI agent. Between the day I submitted and the day they replied, the project rewrote its security policy — in a way that reclassified my findings out of scope — and then closed them citing the new text. This is a writeup of what happened and the genuine question underneath it, because I don't think the answer is obvious and I think the industry hasn't settled it.

I'll start by conceding the other side, because it's strong.

The vendor is not wrong about the hard part

The project is Hermes Agent (Nous Research). Like most agents with shell access, it screens commands against a denylist and prompts the operator before running anything that looks destructive. Their current position is that this gate is an in-process heuristic, not a security boundary — that shell is Turing-complete, a denylist over shell strings is structurally incomplete, and the real boundary for adversarial input is OS-level isolation (run it in a container).

That is correct. You cannot regex your way to a complete boundary over shell, and "run untrusted workloads in a sandbox" is the right posture. I'm not disputing any of that, and any framing of this story that ignores it is unfair.

The three findings (mechanism only — two are still live)

Smart-approval prompt injection. In the optional "smart" mode, a second LLM judges flagged commands. The untrusted command was interpolated into the reviewer's prompt with no separation between data and instructions, and the verdict was parsed with a loose substring match. Injected text could talk the reviewer into approving.

Startup-hook code execution. Any .py file in the agent's hooks directory is executed at gateway startup — no registration, no hash, no signature. A prompt-injected model can write that file via a normal tool call that triggers no approval, yielding code execution on the next restart.

Approval-gate parsing bypass. The detector matches regex against the raw command string, not parsed shell tokens. Equivalent rewrites — quoted command names, variable indirection, alternate shell binaries, octal chmod prefixes, versioned interpreter names — run the same dangerous action and bypass the prompt entirely.

I retested all three against the current release in a clean Docker build before writing this. Finding #1 was meaningfully hardened in June (the live bypass rate dropped from 6/8 to 1/8). Findings #2 and #3 still reproduce on the current version. I'm deliberately not publishing weaponized exploits for the two that are still live.

These matter for one specific, common deployment: the default local backend, exposed to untrusted input (a messaging gateway, web content, MCP output), without a sandbox. In that configuration the prompt is the thing the operator is counting on, and it can be skipped.

The part I think is worth discussing

Two things, both verifiable.

The timeline. The version of SECURITY.md live the day I reported called the approval system "a core security boundary" and explicitly placed in scope "prompt injection ... that results in a concrete bypass of the approval system." Six days later the policy was rewritten ("rewrite policy around OS-level isolation as the boundary"); the approval gate became a non-boundary heuristic and the clause that put my findings in scope was removed. My reports were then closed as out of scope, citing the new sections — without acknowledging that the policy had changed since submission. The commits are public: 401aadb5b, 0d1cbc2dd.

I don't claim malice. The original policy was two weeks old and may have over-claimed; the rewrite reads like a genuine clarification. But the procedure — change the scope while reports are open, close under the new text, don't flag the change — is the part that sits wrong with me, independent of whether the new threat model is right.

The industry inconsistency. Finding #3 is the same class as CVE-2026-24887 in Claude Code: "an error in command parsing" that lets untrusted input "bypass the confirmation prompt." Anthropic rated it 8.8 HIGH and shipped a fix in 2.0.72. Anthropic also recommends sandboxing Claude Code — the same posture Hermes invokes — and still treated a confirmation-prompt bypass as a real, high-severity vulnerability. "The sandbox is the real boundary" and "a prompt bypass is a vulnerability" are evidently not mutually exclusive; a direct peer holds both. And Hermes themselves shipped a fix(security) commit for finding #1 — the very class they'd closed as out-of-scope.

The actual question

Two serious projects looked at the same class of bug and reached opposite conclusions about whether it's a vulnerability at all — and the line between them is drawn in policy, not in code. As we hand agents real shell access, "is the human-confirmation step a security control or a convenience?" stops being philosophical: it decides whether bypasses get fixed, get CVEs, or get closed. I think it's a control whose bypass matters in the default deployment. Reasonable people disagree. I'd like to hear how others draw the line.

Everything above is verifiable from public git history and public vulnerability databases. I'm happy to answer questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment