AI News HubLIVE
原文

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

As deep research agents increasingly automate complex information-seeking tasks, reliable evaluation becomes critical. LLM-as-judge is used to assess these agents, but its reliability is poorly understood. The REFLECT benchmark introduces a fine-grained meta-evaluation, revealing that current LLM judges are unreliable, with best models achieving accuracies below 55% across reasoning, tool-use, and report-quality failures, especially poor on evidence verification. The study offers actionable guidance for building more reliable evaluation pipelines.

Article intelligence

EngineersAdvanced

Key points

  • LLM judges show systematic limitations in evaluating deep research agents, with overall accuracy below 55%
  • REFLECT benchmark generates fine-grained failure instances via controlled interventions on agent traces
  • Current LLM judges perform particularly poorly on evidence verification tasks
  • The study provides actionable guidance for building more reliable evaluation pipelines

Why it matters

This matters because LLM judges show systematic limitations in evaluating deep research agents, with overall accuracy below 55%.

Technical impact

May affect model selection, inference cost, product capability, and evaluation benchmarks.

[2605.19196] Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

[Submitted on 18 May 2026]

Title:Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

View a PDF of the paper titled Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?, by Leyao Wang and 7 other authors

View PDF HTML (experimental)

Abstract:Deep research agents increasingly automate complex information-seeking tasks, producing evidence-grounded reports via multi-step reasoning, tool use, and synthesis. Their growing role demands scalable, reliable evaluation, positioning LLM-as-judge as a supervision paradigm for assessing factual accuracy, evidence use, and reasoning quality. Yet the reliability of these judges for deep research agents remains poorly understood, posing a critical meta-evaluation problem: before deploying LLM judges to supervise research agents, we must first evaluate the judges themselves. Existing meta-evaluations fall short in two ways: (1) reliance on coarse, subjective human-preference agreement; (2) focus on instruction-following or verifiable tasks, leaving open-ended agent executions unexplored. To address these gaps, we introduce REFLECT (REliable Fine-grained LLM judge Evaluation via Controlled inTervention), a meta-evaluation benchmark targeting fine-grained failure detection in agentic environments. REFLECT defines a detailed taxonomy of process- and outcome-level failure modes, instantiated by performing controlled and localized interventions on quality-screened agent execution traces. This yields verifiable, comprehensive, and fine-grained instances for validating the judge models. Our experiments show that current LLM judges remain unreliable: even the best-performing models achieve overall accuracies below 55% across reasoning, tool-use, and report-quality failures, with especially poor performance on evidence verification. Together, our taxonomy and findings expose systematic judge limitations, reveal tradeoffs in cost and reliability, and offer actionable guidance for building more reliable evaluation pipelines for deep research agents.

Subjects:

Computation and Language (cs.CL)

Cite as: arXiv:2605.19196 [cs.CL]

(or arXiv:2605.19196v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2605.19196

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Leyao Wang [view email] [v1] Mon, 18 May 2026 23:55:08 UTC (2,750 KB)

Full-text links:

Access Paper:

View a PDF of the paper titled Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?, by Leyao Wang and 7 other authors

View PDF

HTML (experimental)

TeX Source

view license

Current browse context:

cs.CL

new | recent | 2026-05

Change to browse by:

cs

References & Citations

NASA ADS

Google Scholar

Semantic Scholar

Loading...

Data provided by:

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author

Venue

Institution

Topic

About arXivLabs

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)