2026-06-26 07:06 UTCIn-site rewrite5 min readUpdated: 2026-06-26 07:14 UTC

A curated, non-BS library of the best resources for evaluating agents

A meticulously curated, annotated library of over 443 resources for AI agent evaluation, including papers, blog posts, talks, and tools, maintained by BenchFlow with a focus on quality and verification. Built via recursive citation crawl, practitioner discovery, talk transcription, and adversarial audits, every entry is verified and explained.

SourceHacker News AIAuthor: xdotli

Uh oh!

There was an error while loading. Please reload this page.

Notifications You must be signed in to change notification settings

Fork 23

Star 363

BranchesTags

Open more actions menu

Folders and files

NameName

Last commit message

Last commit date

Latest commit

History

16 Commits

.github

docs

notes

.gitignore

CONTRIBUTING.md

LICENSE

PATTERNS.md

README.md

SCAN.md

Repository files navigation

A curated, opinionated, non-BS library of the best resources for building and evaluating AI agents — papers, blog posts, talks, courses, tools, and benchmarks.

Maintained by BenchFlow ·

Most "awesome" lists are link dumps. This one is annotated and verified: every entry says what it is and why it belongs, URLs are checked, quotes are verbatim, and dead/abandoned tools are pruned (not silently listed). It was assembled by:

a depth-4 recursive citation crawl (11.6k papers, ranked by in-degree) to surface the academic canon,

targeted practitioner-web discovery for the industry sources citation graphs miss (Eugene Yan, Han-Chung Lee, Hamel Husain, Shreya Shankar, Nathan Lambert, …),

47 talks & podcasts transcribed and deep-noted (verbatim + timestamps), and

per-section gap audits with adversarial verification.

443+ curated links · 146 deep reading notes (see notes/). Markers: 🆕 = released/updated 2025–2026 · ⚠️ = caveat. Contributions welcome — see CONTRIBUTING.

📘 Playbook: PATTERNS.md — real, runnable code + worked examples for LLM-as-judge (aligned to humans), pass@k/pass^k, error analysis, trajectory & world-state grading, CI gating, verifiable rewards, and more.

Contents

📘 Playbook — real code & worked examples (PATTERNS.md)

⭐ Must-read starter set (read these first)

1 · Why we need evals

2 · "If you can eval it, you have built it" — eval ⇄ capability ⇄ RL environment

3 · The model / harness / skill decomposition

4 · Observability & the output / eval space (the surfaces you can grade)

5 · Evaluation infrastructure (the eval stack: datasets, scorers, online/offline, tracing, CI)

6 · Benchmark vs. eval (and benchmark integrity: contamination, saturation, label errors, leaderboard gaming)

7 · Evals & RL environments (verifiers, reward design, difficulty calibration, lifecycle)

8 · LLM-as-judge & verifiers (alignment, biases, verifiable vs judgeable)

9 · Agent-specific evaluation (trajectories, tool use, multi-turn, world state, multi-agent, localization)

10 · Safety / adversarial evaluation (prompt injection, jailbreaks, action-authorization, benchmark auditing)

🎙 Talks, podcasts & slides (transcribed + noted)

💬 Eval insights inside general agent posts

🔎 Scan additions

Companies & landscape (eval / RL-environment market)

Notes on provenance & gaps

Deep notes

Contributing

License

⭐ Must-read starter set (read these first)

The Second Half — Shunyu Yao — https://ysymyth.github.io/The-Second-Half/ · blog — "Evaluation becomes more important than training." The field-level why.

An LLM-as-Judge Won't Save the Product, Fixing Your Process Will — Eugene Yan — https://eugeneyan.com/writing/eval-process/ · blog — Process over tooling; evals as the scientific method.

Hidden Technical Debt: Agent Evaluation Infrastructure — Han-Chung Lee — https://leehanchung.github.io/blogs/2026/06/13/hidden-technical-debt-agent-evaluation-infra/ · blog — Control/data plane, the five eval surfaces, state deltas. "Chat eval was a spreadsheet; agent eval is a system."

LLM Evals FAQ — Hamel Husain & Shreya Shankar — https://hamel.dev/blog/posts/evals-faq/ · blog — The densest operational Q&A: error analysis, binary judgments, the benevolent-dictator labeler.

Asymmetry of Verification and Verifier's Law — Jason Wei — https://www.jasonwei.net/blog/asymmetry-of-verification-and-verifiers-law · blog — "Ability to verify == ability to create an RL environment."

Demystifying Evals for AI Agents — Anthropic — https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents · blog — Best primary on agent-specific evals: task design, outcome vs trajectory, isolated trials, pass@k vs pass^k.

How to Build Good Language Modeling Benchmarks — Ofir Press — https://ofir.io/How-to-Build-Good-Language-Modeling-Benchmarks/ · blog — Natural / auto-evaluatable / challenging; the "-200%" difficulty target; ~1-yr saturation.

AI Agents That Matter — Kapoor, Stroebl, Siegel, Nadgir, Narayanan — https://arxiv.org/abs/2407.01502 · paper — Cost as a first-class metric; model-dev vs app-dev; missing holdouts breed overfitting.

Building on Evaluation Quicksand — Nathan Lambert — https://www.interconnects.ai/p/building-on-evaluation-quicksand · blog — LLM eval has no ground truth; contamination; eval↔training coupling.

Who Validates the Validators? (EvalGen) — Shankar, Zamfirescu-Pereira, Hartmann, Parameswaran, Arawjo (UIST '24) — https://arxiv.org/abs/2404.12272 · paper — "Criteria drift": you can't write the rubric before you grade.

Benches 2026 — "LLM benchmarks in the era of agents" — Florian Brand (Prime Intellect) — https://florianbrand.com/posts/benches-2026 · blog + 61-slide talk — The sharpest current read on why benchmarks break in the agent era: the "evals are dead, just measure vibes" backlash, how every layer of the eval-running stack (prompt · sampling temp · grader · harness) swings the score, and that benchmark ground truth is frequently wrong.

A Shared Playbook for Trustworthy Third-Party Evaluations — OpenAI — https://openai.com/index/trustworthy-third-party-evaluations-foundations/ · blog (Safety, May 2026) — What makes independent evals of frontier-model safeguards & capabilities trustworthy: harness selection, the validity hazards that distort results, and the standards third-party evaluators need.

1 · Why we need evals

The Second Half — Shunyu Yao — https://ysymyth.github.io/The-Second-Half/ · blog — The bottleneck shifts from solving problems to defining and evaluating them. (also T2, T7)

An LLM-as-Judge Won't Save the Product, Fixing Your Process Will — Eugene Yan — https://eugeneyan.com/writing/eval-process/ · blog — "Buying or building another evaluation tool won't save the product." Evals = the scientific method in disguise.

Your AI Product Needs Evals — Hamel Husain — https://hamel.dev/blog/posts/evals/ · blog — The canonical "you need evals"; remove all friction from looking at your data; don't rely on generic frameworks.

A Field Guide to Rapidly Improving AI Products — Hamel Husain — https://hamel.dev/blog/posts/field-guide/ · blog — "Error analysis is consistently the highest-ROI activity." The metric for an AI roadmap is experiments run.

In Defense of AI Evals, for Everyone — Shreya Shankar — https://www.sh-reya.com/blog/in-defense-ai-evals/ · blog — Rebuts the anti-eval backlash; evals = the systematic measurement of application quality.

What We Learned from a Year of Building with LLMs — Yan, Bischof, Frye, Husain, Liu, Shankar — https://applied-llms.org/ (Part II: https://www.oreilly.com/radar/what-we-learned-from-a-year-of-building-with-llms-part-ii/) · blog — The "intern test," genchi genbutsu, turning vibe-checks into assertions.

Big Tech's LLM Evals Are Just Marketing — Nathan Lambert — https://www.interconnects.ai/p/evals-are-marketing · blog — Why frontier-lab leaderboard numbers are marketing, not science.

AI Engineering pitfalls — Chip Huyen — https://huyenchip.com/2025/01/16/ai-engineering-pitfalls.html · blog — Common eval/AI-engineering mistakes from the AI Engineering author. (also T6)

Evals Are NOT All You Need — Aishwarya Naresh Reganti & Kiriti Badam (O'Reilly Radar) — https://www.oreilly.com/radar/evals-are-not-all-you-need/ · blog — The essential nuance piece: automated graders alone don't save you; you need a continuous-improvement flywheel of offline tests + production monitoring + real-user iteration. Pairs with Shreya's 'In Defense' to complete the backlash debate. 🆕

Why AI evals are the hottest new skill for product builders — Hamel Husain & Shreya Shankar with Lenny Rachitsky (Lenny's Podcast/Newsletter) — https://www.lennysnewsletter.com/p/why-ai-evals-are-the-hottest-new-skill · talk — The accessible 'why evals matter' on-ramp (live walkthrough of error analysis, open/axial coding) that mainstreamed evals to PMs in 2025; the apartment-leasing-bot anecdote is the canonical 'you can't vibe-check' story. 🆕

How evals drive the next chapter in AI for businesses — OpenAI — https://openai.com/index/evals-drive-next-chapter-of-ai/ · blog — Frontier-lab framing of evals as turning fuzzy business goals into specs and measurable ROI; useful counterweight to Lambert's 'evals are marketing' and grounds the 'why' for enterprise readers. 🆕 ⚠(unverified URL)

Beyond vibe checks: A PM's complete guide to evals — Aman Khan (Arize) with Lenny Rachitsky — https://www.lennysnewsletter.com/p/beyond-vibe-checks-a-pms-complete · blog — The widely-shared PM-oriented argument for moving past 'looked good to me' vibe checks to systematic evals; one of the pieces that made evals a mainstream product skill in 2025. 🆕

A pragmatic guide to LLM evals for devs — Gergely Orosz & Hamel Husain (The Pragmatic Engineer) — https://newsletter.pragmaticengineer.com/p/evals · newsletter — Reaches the broad engineering audience with the core 'why': LLM non-determinism breaks traditional testing, so you need evals. High-distribution motivation piece co-written by Hamel. 🆕

Predicting model behavior before release by simulating deployment (Deployment Simulation) — OpenAI — https://openai.com/index/deployment-simulation/ · blog — Concrete 2026 evidence for why fixed/static evals fail: models recognize when they're being tested and game test suites; replaying ~1.3M real conversations surfaced reward-hacking no fixed eval caught. Strong 'why evals must evolve' argument. 🆕 ⚠(unverified URL)

evals are surprisingly often all you need — Greg Brockman (OpenAI) — https://x.com/gdb/status/1733553161884127435 · blog — The canonical one-liner ('evals are the new unit test') that anchors the whole 'why evals' thesis; frequently cited founding quote for the movement. Short but load-bearing.

Must-reads: Yao · Yan (eval-process) · Hamel (field-guide)

2 · "If you can eval it, you have built it" — eval ⇄ capability ⇄ RL environment

Asymmetry of Verification and Verifier's Law — Jason Wei — https://www.jasonwei.net/blog/asymmetry-of-verification-and-verifiers-law · blog — Trainability tracks verifiability; verifying = creating an RL environment.

A Taxonomy of RL Environments for LLM Agents — Han-Chung Lee — https://leehanchung.github.io/blogs/2026/03/21/rl-environments-for-llm-agents/ · blog — A benchmark is a frozen RL environment; the E = {T,H,V,S,C} decomposition; "verifiable beats judgeable."

The Life Cycle of an RL Environment — Kanav Garg (Core Automation; ex-DeepMind) — talk; summary at https://muratbuffalo.blogspot.com/2026/06/acm-cais-conference-on-ai-and-agentic.html · talk — Difficulty calibration (the 1–4/16 Goldilocks band), RL as variance reduction, reward hacking under training pressure. (local notes: research/notes/kanav-garg-rl-environment-lifecycle.md)

Welcome to the Era of Experience — David Silver & Richard Sutton — https://storage.googleapis.com/deepmind-media/Era-of-Experience%20/The%20Era%20of%20Experience%20Paper.pdf · paper — Human-data value approaching its ceiling; the frontier is agents learning from experience / synthetic environments.

RLHF Book, Ch. 16 — Evaluation — Nathan Lambert — https://rlhfbook.com/c/16-evaluation · book — Evaluation as a reflection of training goals; prompt-format sensitivity (60%→~0%).

What Comes Next with Reinforcement Learning — Nathan Lambert — https://www.interconnects.ai/p/what-comes-next-with-reinforcement · blog — Long-horizon credit assignment; where RL is and isn't ready.

verifiers — Prim

[truncated for AI cost control]