2026-07-01 06:55 UTCIn-site rewrite6 min readUpdated: 2026-07-01 07:31 UTC

Terminal Apps Need a DOM

agent-tui is an open source tool that gives terminal apps a structured, queryable interface similar to the browser DOM, enabling AI agents to interact with terminal screens by referencing stable elements and waiting for state changes.

SourceHacker News AIAuthor: philips

c1@engineering ~

$ cd /engineering && cat ./agent-tui-structured-terminal-access-for-ai-agents.md

AIDeveloper Tools

> Terminal Apps need a DOM

Paul Querna

|2026-06-30|9 min read

share:summarize:

width:

When we were building Squire, C1's software factory, we hit a slightly absurd problem: the AI tools were also built for humans.

Squire could give an agent work. But Claude Code, Codex, Pi, and similar AI harnesses present themselves as terminal apps first. Their live interface is a TUI made for a person: a prompt, a streaming response, approval screens, file-change panes, and a cursor waiting for the next instruction. Another agent can type into that interface. It still needs to know whether the response is done, whether an approval screen appeared, or whether the cursor has returned to the prompt.

That is the problem agent-tui solves. It runs the target program on a PTY, keeps the terminal state alive in a daemon, exposes the rendered screen as text or an outline with stable refs, and lets a client snapshot, press keys, and wait for named screen state. It gives terminal apps the same kind of queryable surface that made browser automation useful.

agent-tui is open source. We are publishing it under the Apache-2.0 license at github.com/ConductorOne/agent-tui. The design comes from our experience with agent-browser: give the agent something it can query instead of a pile of pixels. agent-tui applies that idea to terminal apps.

One common Squire pattern is an orchestration agent: a coding harness receives the assignment, then drives another harness to do the work. In this demo, OpenAI's Codex uses agent-tui to drive the Pi harness through a real terminal session.

Agents driving agents#

agent-tui starts the outer Codex TUI, waits for @codex.input, types the task, and presses enter. Codex then runs the command sequence below, which starts a second agent-tui daemon around Pi.

The Pi side is just another agent-tui session:

agent-tui daemon run agent-tui spawn -- pi --offline --no-extensions --no-context-files --no-skills agent-tui wait --ref '@pi.input' agent-tui type --to '@pi.input' 'Reply with the token formed by joining INNER, AGENT, and OK with _.' agent-tui press --to '@pi.input' '' agent-tui snapshot --mode text

The result is two live screens: one where Codex receives the task, and one where Pi answers inside the nested session.

Vercel's AI SDK harnesses frame agent CLIs as provider-specific surfaces, not one generic wrapper. agent-tui takes the same approach to terminal screens: keep each app's shape, then expose the parts an agent can query.

Why terminal apps need structure#

Most useful terminal programs were built for humans, not machines.

htop, vim, lazygit, psql, language REPLs, and newer agent CLIs such as Claude Code and Codex have different jobs. They share one automation problem: the live interface is a terminal session. It owns a PTY, redraws a grid of cells, and expects a person to infer what changed. Some tools expose batch modes. Many do not. Even when a batch mode exists, it is often a different interface from the human session you need to observe, interrupt, or steer.

An agent can write bytes to a terminal easily. The hard part is knowing what happened after those bytes landed. A full-screen program may repaint in place, move the cursor, enter an alternate screen, update one field, and never print a clean line that says "ready."

The usual choices make bad contracts with the terminal. Escape-sequence parsing treats the byte stream as the API. Rendered-text scraping throws away state. Sleeping between keystrokes punts the problem to the scheduler, which means the script works until CI is slow or one prompt lands in a state you did not match.

Give the terminal a DOM#

Vim is a good stress test: it is a full-screen editor, not a command that prints lines. Here agent-tui drives a real Vim session through refs instead of sleeps.

In the recording, the left pane issues agent-tui commands and the right pane is the Vim PTY. The driver waits for the buffer, reads the mode, enters insert mode, writes hello world, saves hello-world.txt, and checks the file contents.

Raw terminal text is a bad handle for this job. agent-tui exposes an outline: a tree of screen regions with roles and stable refs. A ref is a name for something on the screen. It lets an agent say "the Vim mode indicator" instead of "row 24, column 1."

agent-tui spawn -- vim notes.md agent-tui wait --ref '@vim.buffer' # vim has rendered agent-tui --json snapshot --select '@vim.mode' | jq -c '.data.outline.nodes[0]'

{"durable":true,"ref":"@vim.mode","role":"mode","value":"normal"}

@vim.mode is durable. It names the same part of the screen whether the value is normal, insert, or something else. Refs can be queried with a small selector language: [role=buffer][focused], @vim.mode[value=insert], @tmux.pane[%2].

The built-in vim and shell adapters emit named refs such as @vim.mode and @shell.prompt. A TOML manifest can teach agent-tui the regions of another app without adding Rust code. With no adapter at all, the generic adapter still groups the screen into coarse regions and gives them refs:

agent-tui --json snapshot --mode outline | jq -c '[.data.outline.nodes[] | {ref, role}]'

[{"ref":"@e1","role":"meters"},{"ref":"@e2","role":"table"},{"ref":"@e3","role":"footer"}]

We patterned this after browser automation, where agent-browser has worked well for our own agents. A browser agent should not click pixel 412, 308; it should click the button. A terminal agent should not depend on a fixed row when agent-tui can name the mode, prompt, focused pane, or table.

Wait for screen state#

A snapshot only helps if you know when to take it.

sleep 0.2 is a guess about scheduling, terminal redraw, and the program under test. It will be too long on a fast machine and too short on a loaded runner. Worse, it is not connected to the state you care about.

The wait subcommand is tied to screen state. It can wait for a ref to appear, for a ref to disappear, for a selector value, for a regex, for an event sequence, or for the child process to exit. For screens with no named ref yet, a client can take a snapshot, keep its screen hash, send input, and wait until the rendered grid changes.

Here is a complete vim edit with no sleep:

agent-tui spawn -- vim todo.txt agent-tui wait --ref '@vim.buffer' # 1. wait for the buffer to exist agent-tui press i # 2. enter insert mode agent-tui wait --ref '@vim.mode[value=insert]' # 3. wait for the mode to flip agent-tui type 'review the draft' # 4. type agent-tui press '' # 5. leave insert agent-tui wait --ref '@vim.mode[value=normal]' # 6. wait for the mode to flip back agent-tui press ':wq' # 7. save and quit

Each command waits for the next observable transition. After press i, the script does not assume vim is ready for text. It waits until the parsed mode is insert. After , it waits until the mode is normal again.

Refs avoid a common false positive. A regex can match the literal word insert in the buffer. A wait on @vim.mode[value=insert] watches Vim's parsed mode field. It is not looking at arbitrary screen text.

You can wait for absence too. wait --ref '@vim.cmdline[focused]' --gone blocks until the command prompt closes. For terminal tests, that is usually the difference between "probably done" and "the UI state changed."

Fallback to rendered text#

Not every app has an adapter or a useful state signal. htop is a good example: it has no JSON mode, and the useful output is often just the rendered screen.

agent-tui spawn -- htop agent-tui wait --idle 500 agent-tui --json snapshot --mode text | jq -r .data.text

0[ 0.0%] 4[ 0.0%] 8[ 0.0%] 12[ 0.0%] 1[*******100.0%] 5[ 0.0%] 9[ 0.0%] 13[ 0.0%] Mem[|||||#*@@@@@@@@@@@@@@@16.0G/124G] Tasks: 29, 117 thr, 0 kthr; 7 running Swp[ 0K/0K] Load average: 0.68 0.82 0.69

PID USER PRI NI VIRT RES SHR S CPU%-MEM% TIME+ Command 50386 user 20 0 4900 3336 2456 R 160.0 0.0 0:00.02 htop

Use wait --idle 500 when an app has no better signal. It waits for the screen to stop changing, so it is still tied to terminal output instead of a fixed delay after input. Use refs and selectors when the app has structure. Use idle when it does not.

What about tmux and expect?#

The obvious question is whether this is just tmux send-keys plus capture-pane, or a wrapper around expect.

They are useful, but they stop at a lower level.

tmux capture-pane gives you text from the rendered grid. It does not give you roles, named regions, or durable handles. tmux send-keys can write input, but it has no opinion about what screen state should follow.

expect is line-oriented. It is excellent for programs that print prompts and lines. It is a poor fit for a full-screen ncurses app that repaints a cell grid in place. There is no line that says "vim's mode indicator is now insert." The information is on the screen, but it is not in the stream in the form expect wants.

agent-tui sits above the byte stream and below the agent. It reads the terminal state, assigns names to parts of the screen, and lets the caller wait on those names.

Use stdout when stdout is enough#

Not every command needs a live terminal. If a program already has a non-interactive mode, the right interface is still stdin, stdout, stderr, and an exit code.

The run subcommand exists for commands that should return data. It gives agents a typed, logged wrapper for those calls while PTY automation stays focused on live screens.

agent-tui run -- gh api /repos/ConductorOne/agent-tui \ --jq '{repo: .full_name, lang: .language, default_branch: .default_branch}'

{"default_branch": "main", "lang": "Rust", "repo": "ConductorOne/agent-tui"}

The result is plain data. It can be piped into jq, fed to another step, or asserted on in a test. run also fronts AI CLIs that expose non-interactive modes (claude -p, codex exec, pi --print, opencode run), so one model's answer can become another step's input without screen-scraping a prompt. ask is a short wrapper over that path.

Live AI CLI sessions have both paths. For a one-shot answer, use the data path. For the human terminal session, use the PTY path. Provider manifests expose the prompt and response as screen regions:

agent-tui spawn -- claude agent-tui wait --ref '@claude.input[focused]' agent-tui type --to '@claude.input' 'write a jq filter for this JSON' agent-tui press --to '@claude.input' '' agent-tui wait --ref '@claude.response[name~=/jq/]'

The adapter names the prompt and response from rendered cells. It does not read a provider transcript API. Knowing that a streaming answer is final across every AI CLI still needs provider-specific events or side channels. The split matters because it keeps the two cases separate: use run when the child is already a data-producing process, and use spawn, snapshot, press, and wait when the child is an interactive screen.

Capture artifacts#

A live session also produces files. The daemon records each pane to asciicast-v3 under $XDG_STATE_HOME/agent-tui//.cast. That file works with the asciinema ecosystem: asciinema play can play it, and renderers such as agg can turn it into a GIF for docs.

agent-tui uses the same cast as a test input:

cast="${XDG_STATE_HOME:-$HOME/.local/state}/agent-tui/default/p1.cast" asciinema play "$cast" agent-tui replay "$cast" --expect-snapshot expected.snap

replay does not start the original program. It feeds the recorded output bytes into a fresh terminal engine and compares the rendered snapshot. A demo session can become a regression test input.

For screenshots, snapshot can render the current grid to PNG. --annotate draws boxes and labels for matching refs; --chrome adds a frame for a README or blog image.

agent-tui snapshot --mode outline \ --png vim-mode.png \ --annotate '@vim.*' \ --chrome 'vim todo.txt'

Use the cast when time matters. Use the PNG when the current frame matters, optionally with the screen regions named on

[truncated for AI cost control]