AI News HubLIVE
原文

Faithful or Fabricated? A Causal Framework for Rationalization Bias in LLM Judges

Large language models (LLMs) are increasingly used as automatic judges for summarization and dialogue evaluation. Prior work has documented biases such as position, verbosity, and style preferences, but largely focuses on outcomes. This paper asks whether LLM judges are cue-invariant, introducing a causal framework with interventions and metrics to test stability of rankings and explanations under non-evidential cue perturbations. Results show substantial cue-anchored rationalization, effectively mitigated by the PROOF-BEFORE-PREFERENCE method.

Article intelligence

InvestorsAdvanced

Key points

  • LLM judges exhibit cue-anchored rationalization bias, where non-evidential cues affect their explanations.
  • The paper develops interventions (Blind, Truth, Flip, Placebo, Reveal-After) and tie-aware metrics to quantify outcome and rationale anchoring.
  • Using a dataset of 1,000 summaries, PROOF-BEFORE-PREFERENCE significantly improves cue invariance over baselines.

Why it matters

This matters because LLM judges exhibit cue-anchored rationalization bias, where non-evidential cues affect their explanations.

Technical impact

May affect model selection, inference cost, product capability, and evaluation benchmarks.

[2605.23970] Faithful or Fabricated? A Causal Framework for Rationalization Bias in LLM Judges

[Submitted on 13 May 2026]

Title:Faithful or Fabricated? A Causal Framework for Rationalization Bias in LLM Judges

View a PDF of the paper titled Faithful or Fabricated? A Causal Framework for Rationalization Bias in LLM Judges, by Riya Tapwal and 1 other authors

View PDF HTML (experimental)

Abstract:Large language models (LLMs) are increasingly used as automatic judges for summarization and dialogue evaluation. Prior work has documented biases such as position, verbosity, and style preferences, but largely focuses on outcomes, leaving judge explanations underexplored. We instead ask whether LLM judges are cue-invariant, i.e., whether their rankings and explanations remain stable when non-evidential cues are perturbed while holding the underlying texts fixed. We introduce a suite of cue interventions (Blind, Truth, Flip, Placebo, Reveal-After) and tie-aware metrics that quantify outcome anchoring and rationale anchoring, including label-aligned rhetoric and explanation drift, alongside consistency and stereotype-intrusion checks. We design anchoring attacks using verbosity and confidence cues, and compare two mitigations: structured chain-of-thought prompting and PROOF-BEFORE-PREFERENCE (evidence lock, score, rank). Using a new dataset of 1,000 summaries from traditional extractive models and LLMs, we find substantial cue-anchored rationalization under label and placebo perturbations, while PROOF-BEFORE-PREFERENCE markedly improves cue invariance over baselines.

Subjects:

Computation and Language (cs.CL)

Cite as: arXiv:2605.23970 [cs.CL]

(or arXiv:2605.23970v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2605.23970

arXiv-issued DOI via DataCite

Submission history

From: Abhishek Kumar Mr [view email] [v1] Wed, 13 May 2026 07:00:16 UTC (6,274 KB)

Full-text links:

Access Paper:

View a PDF of the paper titled Faithful or Fabricated? A Causal Framework for Rationalization Bias in LLM Judges, by Riya Tapwal and 1 other authors

View PDF

HTML (experimental)

TeX Source

view license

Current browse context:

cs.CL

new | recent | 2026-05

Change to browse by:

cs

References & Citations

NASA ADS

Google Scholar

Semantic Scholar

Loading...

Data provided by:

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author

Venue

Institution

Topic

About arXivLabs

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)