Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
As deep research agents increasingly automate complex information-seeking tasks, reliable evaluation becomes critical. LLM-as-judge is used to assess these agents, but its reliability is poorly understood. The REFLECT benchmark introduces a fine-grained meta-evaluation, revealing that current LLM judges are unreliable, with best models achieving accuracies below 55% across reasoning, tool-use, and report-quality failures, especially poor on evidence verification. The study offers actionable guidance for building more reliable evaluation pipelines.
Article intelligence
Key points
- LLM judges show systematic limitations in evaluating deep research agents, with overall accuracy below 55%
- REFLECT benchmark generates fine-grained failure instances via controlled interventions on agent traces
- Current LLM judges perform particularly poorly on evidence verification tasks
- The study provides actionable guidance for building more reliable evaluation pipelines
Why it matters
This matters because LLM judges show systematic limitations in evaluating deep research agents, with overall accuracy below 55%.
Technical impact
May affect model selection, inference cost, product capability, and evaluation benchmarks.
[2605.19196] Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
[Submitted on 18 May 2026]
Title:Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
View a PDF of the paper titled Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?, by Leyao Wang and 7 other authors
View PDF HTML (experimental)
Abstract:Deep research agents increasingly automate complex information-seeking tasks, producing evidence-grounded reports via multi-step reasoning, tool use, and synthesis. Their growing role demands scalable, reliable evaluation, positioning LLM-as-judge as a supervision paradigm for assessing factual accuracy, evidence use, and reasoning quality. Yet the reliability of these judges for deep research agents remains poorly understood, posing a critical meta-evaluation problem: before deploying LLM judges to supervise research agents, we must first evaluate the judges themselves. Existing meta-evaluations fall short in two ways: (1) reliance on coarse, subjective human-preference agreement; (2) focus on instruction-following or verifiable tasks, leaving open-ended agent executions unexplored. To address these gaps, we introduce REFLECT (REliable Fine-grained LLM judge Evaluation via Controlled inTervention), a meta-evaluation benchmark targeting fine-grained failure detection in agentic environments. REFLECT defines a detailed taxonomy of process- and outcome-level failure modes, instantiated by performing controlled and localized interventions on quality-screened agent execution traces. This yields verifiable, comprehensive, and fine-grained instances for validating the judge models. Our experiments show that current LLM judges remain unreliable: even the best-performing models achieve overall accuracies below 55% across reasoning, tool-use, and report-quality failures, with especially poor performance on evidence verification. Together, our taxonomy and findings expose systematic judge limitations, reveal tradeoffs in cost and reliability, and offer actionable guidance for building more reliable evaluation pipelines for deep research agents.
Subjects:
Computation and Language (cs.CL)
Cite as: arXiv:2605.19196 [cs.CL]
(or arXiv:2605.19196v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2605.19196
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Leyao Wang [view email] [v1] Mon, 18 May 2026 23:55:08 UTC (2,750 KB)
Full-text links:
Access Paper:
View a PDF of the paper titled Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?, by Leyao Wang and 7 other authors
View PDF
HTML (experimental)
TeX Source
view license
Current browse context:
cs.CL
new | recent | 2026-05
Change to browse by:
cs
References & Citations
NASA ADS
Google Scholar
Semantic Scholar
Loading...
Data provided by:
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Code, Data and Media Associated with this Article
alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos
Demos
Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
TXYZ.AI (What is TXYZ.AI?)
Related Papers
Recommenders and Search Tools
Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
Author
Venue
Institution
Topic
About arXivLabs
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.
Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)