Faithful or Fabricated? A Causal Framework for Rationalization Bias in LLM Judges
Large language models (LLMs) are increasingly used as automatic judges for summarization and dialogue evaluation. Prior work has documented biases such as position, verbosity, and style preferences, but largely focuses on outcomes. This paper asks whether LLM judges are cue-invariant, introducing a causal framework with interventions and metrics to test stability of rankings and explanations under non-evidential cue perturbations. Results show substantial cue-anchored rationalization, effectively mitigated by the PROOF-BEFORE-PREFERENCE method.
Article intelligence
Key points
- LLM judges exhibit cue-anchored rationalization bias, where non-evidential cues affect their explanations.
- The paper develops interventions (Blind, Truth, Flip, Placebo, Reveal-After) and tie-aware metrics to quantify outcome and rationale anchoring.
- Using a dataset of 1,000 summaries, PROOF-BEFORE-PREFERENCE significantly improves cue invariance over baselines.
Why it matters
This matters because LLM judges exhibit cue-anchored rationalization bias, where non-evidential cues affect their explanations.
Technical impact
May affect model selection, inference cost, product capability, and evaluation benchmarks.
[2605.23970] Faithful or Fabricated? A Causal Framework for Rationalization Bias in LLM Judges
[Submitted on 13 May 2026]
Title:Faithful or Fabricated? A Causal Framework for Rationalization Bias in LLM Judges
View a PDF of the paper titled Faithful or Fabricated? A Causal Framework for Rationalization Bias in LLM Judges, by Riya Tapwal and 1 other authors
View PDF HTML (experimental)
Abstract:Large language models (LLMs) are increasingly used as automatic judges for summarization and dialogue evaluation. Prior work has documented biases such as position, verbosity, and style preferences, but largely focuses on outcomes, leaving judge explanations underexplored. We instead ask whether LLM judges are cue-invariant, i.e., whether their rankings and explanations remain stable when non-evidential cues are perturbed while holding the underlying texts fixed. We introduce a suite of cue interventions (Blind, Truth, Flip, Placebo, Reveal-After) and tie-aware metrics that quantify outcome anchoring and rationale anchoring, including label-aligned rhetoric and explanation drift, alongside consistency and stereotype-intrusion checks. We design anchoring attacks using verbosity and confidence cues, and compare two mitigations: structured chain-of-thought prompting and PROOF-BEFORE-PREFERENCE (evidence lock, score, rank). Using a new dataset of 1,000 summaries from traditional extractive models and LLMs, we find substantial cue-anchored rationalization under label and placebo perturbations, while PROOF-BEFORE-PREFERENCE markedly improves cue invariance over baselines.
Subjects:
Computation and Language (cs.CL)
Cite as: arXiv:2605.23970 [cs.CL]
(or arXiv:2605.23970v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2605.23970
arXiv-issued DOI via DataCite
Submission history
From: Abhishek Kumar Mr [view email] [v1] Wed, 13 May 2026 07:00:16 UTC (6,274 KB)
Full-text links:
Access Paper:
View a PDF of the paper titled Faithful or Fabricated? A Causal Framework for Rationalization Bias in LLM Judges, by Riya Tapwal and 1 other authors
View PDF
HTML (experimental)
TeX Source
view license
Current browse context:
cs.CL
new | recent | 2026-05
Change to browse by:
cs
References & Citations
NASA ADS
Google Scholar
Semantic Scholar
Loading...
Data provided by:
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Code, Data and Media Associated with this Article
alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos
Demos
Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
TXYZ.AI (What is TXYZ.AI?)
Related Papers
Recommenders and Search Tools
Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
Author
Venue
Institution
Topic
About arXivLabs
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.
Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)