2026-05-26 04:00 UTCOriginal source2 min readUpdated: 2026-06-30 13:03 UTC

Faithful or Fabricated? A Causal Framework for Rationalization Bias in LLM Judges

Large language models (LLMs) are increasingly used as automatic judges for summarization and dialogue evaluation. Prior work has documented biases such as position, verbosity, and style preferences, but largely focuses on outcomes. This paper asks whether LLM judges are cue-invariant, introducing a causal framework with interventions and metrics to test stability of rankings and explanations under non-evidential cue perturbations. Results show substantial cue-anchored rationalization, effectively mitigated by the PROOF-BEFORE-PREFERENCE method.

SourcearXiv Computational LinguisticsAuthor: Riya Tapwal, Abhishek Kumar, Carsten Maple

[2605.23970] Faithful or Fabricated? A Causal Framework for Rationalization Bias in LLM Judges

[Submitted on 13 May 2026]

Title:Faithful or Fabricated? A Causal Framework for Rationalization Bias in LLM Judges

View a PDF of the paper titled Faithful or Fabricated? A Causal Framework for Rationalization Bias in LLM Judges, by Riya Tapwal and 1 other authors

View PDF HTML (experimental)

Abstract:Large language models (LLMs) are increasingly used as automatic judges for summarization and dialogue evaluation. Prior work has documented biases such as position, verbosity, and style preferences, but largely focuses on outcomes, leaving judge explanations underexplored. We instead ask whether LLM judges are cue-invariant, i.e., whether their rankings and explanations remain stable when non-evidential cues are perturbed while holding the underlying texts fixed. We introduce a suite of cue interventions (Blind, Truth, Flip, Placebo, Reveal-After) and tie-aware metrics that quantify outcome anchoring and rationale anchoring, including label-aligned rhetoric and explanation drift, alongside consistency and stereotype-intrusion checks. We design anchoring attacks using verbosity and confidence cues, and compare two mitigations: structured chain-of-thought prompting and PROOF-BEFORE-PREFERENCE (evidence lock, score, rank). Using a new dataset of 1,000 summaries from traditional extractive models and LLMs, we find substantial cue-anchored rationalization under label and placebo perturbations, while PROOF-BEFORE-PREFERENCE markedly improves cue invariance over baselines.

Subjects:

Computation and Language (cs.CL)

Cite as: arXiv:2605.23970 [cs.CL]

(or arXiv:2605.23970v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2605.23970

arXiv-issued DOI via DataCite

Submission history

From: Abhishek Kumar Mr [view email] [v1] Wed, 13 May 2026 07:00:16 UTC (6,274 KB)

Full-text links:

Access Paper:

View a PDF of the paper titled Faithful or Fabricated? A Causal Framework for Rationalization Bias in LLM Judges, by Riya Tapwal and 1 other authors

View PDF

HTML (experimental)

TeX Source

view license

Current browse context:

cs.CL

new | recent | 2026-05

Change to browse by:

References & Citations

NASA ADS

Google Scholar

Semantic Scholar

Data provided by:

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)