We gave an AI agent eyes. It didn't even use them
An experiment with AI agent Goose and Claude Haiku 4.5 showed that giving an agent vision capabilities doesn't guarantee it will use them. The agent succeeded on a tough table extraction task not by seeing, but by using a layout-aware text tool. The run was recorded via the open AVP standard, revealing that persistence and the right tools matter more than pricey models.
We gave an AI agent eyes. It didn't even use them.
May 29, 2026
When we saw how much Opus 4.8 cost, we decided to take a look at what the bottom shelf of the model aisle looked like. What resulted is a sort of recession-proof benchmark: how much hard work can a cheaper model accomplish, provided it's wrapped by a solid agent harness (Goose)?
So we reached for Claude Haiku 4.5 and gave it an extremely annoying PDF page to extract (a page from ParseBench, lifted straight from an arXiv paper). We tested two agent configurations: one that could see, via the pdf-vision MCP server, and one that could only read text, via Goose's built-in pdf_tool.
A few gentle spoilers on what we found, before you read on:
A good harness (Goose) can deliver on a genuinely tough task with an older, cheaper model, provided it has access to the right tools.
Blessing an agent with the gift of vision doesn't actually mean it is going to use it. In this case, the agent didn't use its eyesight at all.
Every step below is recorded with the Agent Voyager Project (AVP), a free, open, platform-agnostic standard for capturing what an agent does. Numbers and quotes are verbatim from the trajectories, on claude-haiku-4-5.
View the AVP spec ↗
The page that eats parsers
Four tables crammed onto one sheet
This is page 7 of a 2012 econometrics paper, pulled from ParseBench. Four separate tables are crammed onto it. The one that matters is Table 7: two six-by-six correlation matrices stacked on top of each other, triangular, half the cells blank, and values like 0.47 [0.49] where two numbers share one cell.
Table 7. Easy enough to read with your eyes, but brutal to read as text, because the layout carries the meaning and flat text throws the layout away.
The task we gave Goose was easy to state: download the page, rebuild it as an HTML table, do not get it wrong.
Attempt 1
Goose + pdf_tool5 turns$0.0553% failed
First, the obvious move. Goose's built-in PDF reader (pdf_tool, a pdfplumber wrapper) pulls the text off the page. Here is what it handed back.
what pdf_tool returned
… Mar c h FB 4 - 7.309 O 9 - 1.513 69.312 1531.360 7.270 Ta ble 7. C ross c or r e latio n c oe f fic ien ts fo r six C P I ti me se rie s a nd their fir s t diff e r e nc e s. Or i g inal se rie s include 1 24 r e a din g s, and th e ir f irst di ff e re n c e s 123 r e a din g s. F FB SE F V O R PR R SH O F 1 FB 0.99998 1 SE F V 0.99714 0.99671 1 O R PR 0.98356 0.98295 0.98702 1 R SH 0. 97533 0.97478 0.97736 0.99698 1 O 0.97752 0.97661 0.98664 0.95629 0.93924 1 d F d FB d SE F V d O R PR d R SH d O d F 1 d FB 0.994 1 d SE F V 0.47 [ 0. 49] 0.4 8 [ 0. 49] 1 …
Every table on the page, poured into one run-on stream. Aug ust. ORP R. ti me se rie s. No rows, no columns, no way to tell where one table ends and the next begins. Goose even tried to pull images for the structure and got back “No images found in PDF,” so it worked with the text. It rebuilt the matrices, re-read to check, and declared victory.
“All values match perfectly.”
Goose, on claude-haiku-4-5, right before scoring 53%
It scored 53%, a clear fail, and nothing in the run flagged it: five turns and five cents, which made it the fastest run we recorded on the hardest page on the board. The score on its own just says “fail,” but the trajectory shows something worse, which is an agent that was confidently wrong and had no idea.
0/0AVP observes:…
▸View full AVP JSON.,
model claude-haiku-4-5tools shell, write, edit, computercontrollerweb_scrape, computercontrollerpdf_tool
The whole voyage, recorded: a quick, smooth, doomed five turns. Press play, then open “View full AVP JSON” to read the agent's own words at each step.
▸View the full commission promptDownload the PDF page at {input}, read it, rebuild it as a single HTML , then re-read the original and verify your table matches before returning. Output only the HTML.
Attempt 2
Goose + pdf-vision24 turns$0.33100% passed
Same model, same page. This time we swapped the built-in reader for pdf-vision, an MCP server whose headline trick is rendering the page to an image so the model can look at it. We even told Goose, in the prompt, to trust the picture over the text.
The eyes did not work. The image kept coming back empty, so Goose tried to see the page another way, and then kept trying as each new approach failed.
“Now let me render the page as an image to see the table structure.”
get_page_image came back empty
“Let me get the page image with file output so I can view it.”
get_page_image saved a PNG it still can't see
“Now let me view the image using the computer controller.”
open page.png nothing
“Let me use computer controller to view the image.”
cat page.png | base64 a wall of base64
“Let me examine the PDF more carefully to identify which page contains the table.”
get_pdf_info “the PDF has 1 page”
“Each Y position represents a row. Let me build the complete tables.”
reads the text's bounding-box coordinates instead
“Let me take a completely different approach.”
get_page_text format=markdown
“Perfect! The markdown format shows the tables clearly.”
builds the table, returns it
That was eight attempts to look at a picture that was never going to load, one of which piped a PNG through base64 in the terminal in the hope of reading it by hand. When the eyes failed, Goose fell back to the same tool's other trick, layout-aware text, exporting the page as markdown with the structure the flat reader had destroyed.
what pdf-vision returned (markdown)
Table 7. Cross correlation coefficients for six CPI time series and their first differences.
||F|FB|SEFV|ORPR|RSH|O| |---|---|---|---|---|---|---| |F|1|||||| |FB|0.99998|1||||| |SEFV|0.99714|0.99671|1|||| |ORPR|0.98356|0.98295|0.98702|1||| |RSH|0.97533|0.97478|0.97736|0.99698|1|| |O|0.97752|0.97661|0.98664|0.95629|0.93924|1|
||dF|dFB|dSEFV|dORPR|dRSH|dO| |---|---|---|---|---|---|---| |dF|1|||||| |dFB|0.994|1||||| |dSEFV|0.47 [0.49]|0.48 [0.49]|1|||| |dORPR|0.12 [0.26]|0.12 [0.26]|0.31 [0.35]|1||| |dRSH|0.13 [0.30]|0.12 [0.28]|0.10 [0.29]|0.31 [0.37]|1|| |dO|-0.18 [0.30]|-0.18 [0.28]|0.06 [0.29]|0.002 [-0.21]|0.04 [-0.29]|1|
That export had real rows and columns, the two matrices kept separate, and the bracketed cells intact. From there it was straightforward, and Goose finished the table.
Goose scored 100% and passed, on the one page on the board that plain text couldn't touch.
goose-vision, claude-haiku-4-5
0/0AVP observes:…
▸View full AVP JSON.,
model claude-haiku-4-5tools shell, write, editmcp pdf-vision
Twenty-four turns of the same run, recorded end to end. Watch how much longer this voyage is than the last one.
▸View the full commission promptReproduce the table from a PDF page as a single HTML .
- Download it: curl -sL '{input}' -o page.pdf
- Render the page to an image with the pdf-vision get_page_image tool and LOOK at it. The image is the ground truth for the 2D layout: how many columns there are, which cells are merged or span rows, section-header rows, and which values visually share one cell (e.g. multiple holders inside a single cell). Use get_page_text only to copy exact text; trust the image for structure.
- Rebuild as ONE HTML that matches what you see: one
per visual row, for header cells, for data, colspan/rowspan for merged cells. Keep column order and exact cell text. Do NOT split a value into extra columns or merge rows that are visually separate.
- Verify against the image: the header column count and each row's cell count must match the page. Fix and redo if not. Do not submit a table you can see is wrong.
- Output only the final HTML .
What we actually learned
Three things the numbers won't tell you.
The eyes never mattered.
The model never once saw the page. The win came entirely from text, and specifically from the tool that kept the structure intact. pdf-vision is a misleading name for what actually rescued this run.
A cheap model cleared a hard page on grit.
Haiku did not get smarter between the two runs. What changed is that the second time it had a tool worth being stubborn with, and a harness that refused to stop. Twenty-four turns, eight dead ends, one quiet pivot, a perfect answer. That is the harness wrangling a weaker model all the way to the finish.
You only know this because it was recorded.
The dollar figure says vision cost about 7x more. The score says vision won. Neither tells you the agent never used its eyes, or that the real hero was a markdown export it reached for on turn 22. AVP captures every step, every tool call, every line of the agent's own reasoning, in one open format. The gap between “vision won” and “persistence won, here is the exact turn” is the whole reason we record trajectories.
Cheaper models can do harder things than their price tag suggests, if the harness is good and you can see what it is doing. We are going to keep poking at that. More from the lab soon.
See how AVP records a run ↗