AI News HubLIVE
In-site rewrite6 min read

Lessons from Building Evals for Financial AI Agents

This article shares key lessons from three years of building internal evaluations for financial AI agents. The author argues that absolute scoring fails beyond a quality threshold, and relative scoring is more effective. Key insights include using the strongest frontier models as judges, granting them access to raw data, accounting for variance in both agents and judges, and evaluating the agent's reasoning path alongside outcomes. The article also critiques existing financial benchmarks and introduces an internal 'Adjusted Cash Flow' eval.

SourceHacker News AIAuthor: smallwoodal

They say you’re the average of the five people you spend the most time with. If that’s true, we’re all slowly becoming a weighted average of our AI agents.

I definitely feel it, having started saying “canonical” in social settings.

After leaving the hedge fund desk three years ago, I’ve spent most of my waking hours prompting LLMs, testing AI agents, and evaluating stock research. From promising $1,000 tips to GPT-3.5, all the way to working with today’s agents, harnesses, and tools.

It’s taken a toll. But it’s also forced me to go deep on both finance and AI, and to form views on what “good” equity research actually looks like — and how to evaluate it.

The challenge is that most publicly available “finance AI” benchmarks fail at one key thing: capturing nuance. And when it comes to investing, nuance matters.

So I built my own internal evaluations. There’s still a lot of work to do, but these are the core lessons so far:

Absolute scoring fails past a certain quality threshold - the “max out” problem

Use relative scoring to capture nuance and compare agents - but do it right

Use the strongest frontier models as your judges - forget about costs

Give the judges access to raw data - they can’t take things at face value

Variance applies to both contenders and judges - act accordingly

Outcomes matter, but how you get there is also key - evaluate the agent’s path

What’s next? Live earnings coverage - the beginning of truly autonomous research

Why Evaluating Equity Research Is Hard

Deep equity research is fundamentally different from most tasks we evaluate LLMs on.

No single answer

There is no deterministic verification. You cannot just check whether the note is “correct”, because in many cases there is no single correct answer. There are judgment calls, different valid framings, and legitimate disagreements between smart analysts.

One analyst might treat margin pressure as temporary overinvestment. Another might see it as evidence of structural competition. Both can be financially literate.

This is why scoring against a rubric hits a ceiling very quickly. Once an agent is basically competent (it applies the right methodology, does the math correctly, and presents a financially defensible case) absolute scores stop differentiating well. Two reports can both tick every box, and still be different in quality.

Judges need competition

What matters is not whether multiple reports collapse the uncertainty into the same answer. It’s whether the research improves your map of the possible outcomes — surfacing scenarios most missed, assigning sensible probability weights, and identifying what can move the odds.

Imagine you’re judging 2 reports. You read report A, it’s really good, you score it 9/10. Then someone erases your memory, you read report B and also score it 9/10. But when reading them side by side, you realise that B is better than A. Same absolute score, different quality.

LLM judges (which simply means AI models that act as a grader for the work of other AIs) face the same problem. When asked to score individual research in isolation, they struggle to grade quality past “good”.

The distinction of great vs good is exactly where equity research lives.

The Problem with Financial Benchmarks

Many public finance benchmarks, especially the ones that scale cleanly, gravitate toward tasks with deterministic answers.

Retrieval ≠ finance

Some are basically retrieval tests in disguise. “What dividend did Company X declare in 2023?” “What was revenue in the latest quarter?”

These are useful tests. They measure whether an agent can find the right fact and cite the right source. But it is very different from measuring investment judgment.

At Primer, we’ve maxed out FinRetrieval and scored 100%. That was a real achievement, and it showed that our agent could retrieve financial facts reliably. But you cannot really observe GPT-5.5 performance versus GPT-5.4 with that kind of benchmark.

Unit-tests ≠ judgment

Another popular category is model-building. A lot of these benchmarks are closer to Excel formula verification: can the agent link cells correctly, build a three-statement model, refresh a model with newly disclosed numbers?

Again, useful. But mostly as a “unit test”. They do not test whether the agent can actually model a company in the investment sense: make reasoned assumptions, understand drivers, and forecast sensibly.

There are also benchmarks around junior investment banking tasks: presentations, decks, market summaries, and similar outputs. Those can be valuable, but they still do not answer the core question: how do you measure nuance inside actual investment research?

Why Internal Benchmarks

For a long time, I did what many people do: I eyeballed outputs. Too much, honestly.

Having domain knowledge can be a curse here. It works. You can read a report for a company you know inside-out, and see whether it surprises you. But it is not scalable, not systematic, and not good enough if you are trying to compare:

Is GPT5.5 xhigh better than high?

Is harness A better than harness B?

Do these tools improve output quality?

By how much?

So I had to build internal benchmarks.

Different evals answer different questions

There are a few different ways to evaluate agents, and they are not interchangeable.

Ground-truth evals are great when the task is to retrieve a number. Rubrics are useful for enforcing minimum standards. Baselines are useful when you have a human reference, or current “best” output that represents the quality bar. Relative allows you to actually answer the question: how do these agents compare against each other.

I’ll skip ground-truth and rubric, as we’re all familiar with those.

Baseline - what’s your current “best”?

In my case, the baseline was a research note produced by a fixed-pipeline, not an agent. A sequence of tightly-scoped prompts to LLMs. Some running in parallel, some feeding into each other. The data used was always provided in the most token-efficient format, exactly what the LLM needed. I was even producing multiple notes and having a final model consider all those points of view. That minimized variance and shed more light on the range of possible outcomes.

It worked well when end-to-end agents were not yet reliable enough to make open-ended research decisions. The note was consistently “good”. But it was still a controlled workflow, not an autonomous analyst.

That made it a strong baseline — because it represented the current “best”.

Adjusted Cash Flow - can the agent beat your “best”?

One of my first serious internal evals was the ‘Adjusted Cashflow Flow Note’ evaluation, based on work I used in practice: stripping out accounting noise to understand the true underlying cash generation ability of a business.

The agent has to decide which cash-flow items are recurring, which are accounting artifacts, which adjustments are economically justified, and how much confidence to attach to each.

There is no single right answer. But there are better and worse ways to reason.

1st time our agent won the title

GPT5.4 was a step up. We all felt it. But how much so?

I decided to test our agent on the adjusted cashflow note for Copart (CPRT), an online auto auction company.

The “rubric” eval gave both our baseline and the agent the same score. Both notes were financially competent. Both understood that reported cash flow needed adjustment. Both produced usable bridges.

But the agent handled operating leases more rigorously than the baseline. Instead of using lease expense as a proxy, it estimated right-of-use asset depreciation and lease interest separately. It also showed better adjustments log, explained uncertainty more clearly, and reconciled the final bridge more cleanly.

For the 1st time, an agent had beat our “baseline”.

That is the kind of distinction standard evals struggled to capture. A rubric eval asked: did the agent build a cash-flow note? A baseline eval asked: did the agent reveal something more useful about the company’s true cash generation and the range of plausible outcomes?

Why Relative Scoring Works

This is probably the most important scoring lesson.

GPT5.4 had beat our baseline - fantastic. But I still couldn’t answer questions like:

Is xhigh reasoning effort better than high?

Is GPT5.5 better than 5.4?

Is this extra data-source improving quality?

How about this change in the harness?

Once agents got good enough to consistently beat baseline, that eval also started suffering form the same issue as rubrics: incremental performance was hard to observe.

So I needed something different.

Side-by-side comparison reveals nuance

The best solution I have found is relativity.

Put the outputs next to each other. Let the judge see all of them together. Ask it to rank them, score them, and explain the differences.

That is also how investors actually evaluate research. A portfolio manager does not read one note in a vacuum, assign it an absolute score, and move on. They speak to multiple analysts, they compare arguments. They notice what one analyst saw that another missed. They revise their view as new evidence or better reasoning appears.

“I liked that first report… but now that I’ve read the second one, I realize the first one missed this key issue.”

That is the evaluation dynamic we want.

Robinhood: same score, different rank

A Robinhood (HOOD) forecast eval is a good example. Robinhood is a brokerage and fintech platform. Two agents produced a similar model and forecast note which scored the same in absolute terms. But the relative judge still preferred one output because it used alternative data to corroborate near-term trends.

The agent used X/Twitter to check that Robinhood’s product expansion, especially event contracts and crypto/product launches, was getting positive real-time attention.

It would have been impossible to know in isolation that one model was missing “X”. But when put side by side, it became apparent.

Control groups matter

Relative scoring is not perfect. The score is only valid inside a specific comparison set. Rankings can move if the control group changes. A report that looks strong against weak outputs may look average against stronger ones.

We deal with this by keeping stable control outputs, running multiple comparisons, and looking for consistent separation across judge passes rather than treating one ranking as gospel. What matters is whether an agent keeps winning across runs, judges, companies, and sets.

That signal is much stronger than an isolated 8.7/10.

Use the Strongest Judges

This one is key.

For serious research workflows — Adjusted Cash Flow Notes, modeling, forecasting — we use the strongest judge model available. Today, that is 5.5 Pro or xhigh, but the principle matters more than the model name.

If the task requires expert judgment, the judge has to be capable of expert judgment. It needs to spot subtle analytical weaknesses, distinguish insight from verbosity, and recognize when a conclusion is financially defensible but not actually useful.

Give the Judge Access to Data

If the agent used source documents, market data, X, Polymarket, or alternative data, the judge needs to be able to check the evidence. Enough to verify the claims it is being asked to evaluate.

With access to the underlying data, the judge can ask: “Is this actually right? Did it ignore something important? Did it overstate the conclusion?”

That is similar to how a fund manager reads research. If a claim matters, you verify it. You check the source. You run extra checks until you are comfortable with the conclusion.

Without data access, the judge risks committing the sin any fund manager I ever met would punish hardest: taking things at face value.

Variance Applies to Both Agent and Judge

LLMs are stochastic. One run is not enough. Wrong conclusions are easy to reach.

We run every agent configuration at least three times, ofte

[truncated for AI cost control]