2026-06-16原文2 min readUpdated: 2026-06-16

Metric Match: A Subset Selection Approach to Evaluating LLM Judge Reliability

Metric Match is a method for estimating correlation-based reliability metrics of LLM judges from limited human annotations. It selects a representative subset of samples for annotation, achieving a win-rate of 0.838 against random selection across four metrics and 15 datasets, reducing average estimation error by 18.7% and annotation needs by 32.5%. A medical case study showed savings of $1,041.67. The method also extends to reliability classification. Code is publicly available.

SourcearXiv AIAuthor: Alyssa Unell, Natalie Dullerud, Naomi Boneh, Meena Jagadeesan, Tatsu Hashimoto, Nigam Shah, Sanmi Koyejo

[2606.15029] Metric Match: A Subset Selection Approach to Evaluating LLM Judge Reliability

[Submitted on 12 Jun 2026]

Title:Metric Match: A Subset Selection Approach to Evaluating LLM Judge Reliability

View a PDF of the paper titled Metric Match: A Subset Selection Approach to Evaluating LLM Judge Reliability, by Alyssa Unell and 6 other authors

View PDF HTML (experimental)

Abstract:LLM judges are used to reduce the need for costly human labor in evaluating open-ended text generation. However, the reliability of these judges depends critically on their alignment with human raters -- a property that itself depends on costly human annotations. In this work, we develop a method (Metric Match) for estimating correlation-based reliability metrics of LLM judges from limited annotations. Metric Match selects a subset of samples for human annotation such that the subset matches the population reliability metric with respect to acquired synthetic labels. We empirically show that Metric Match achieves a win-rate of 0.838 against random subset selection across four different correlation metrics and 15 datasets, with an 18.7% decrease in average estimation error and reduces annotation needs by 32.5%. We provide a cost model and highlight a medical case study where our method saves $1,041.67 compared to random selection for expert annotation. Further, we shift our task from reliability estimation to reliability classification of whether a given judge is above a deployment threshold, outperforming random selection with Metric Match. All project code is publicly available, and we additionally provide an installable package for ease of use.

Subjects:

Artificial Intelligence (cs.AI)

Cite as: arXiv:2606.15029 [cs.AI]

(or arXiv:2606.15029v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2606.15029

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Alyssa Unell [view email] [v1] Fri, 12 Jun 2026 23:54:16 UTC (10,062 KB)

Full-text links:

Access Paper:

View a PDF of the paper titled Metric Match: A Subset Selection Approach to Evaluating LLM Judge Reliability, by Alyssa Unell and 6 other authors

View PDF

HTML (experimental)

TeX Source

view license

Current browse context:

cs.AI

new | recent | 2026-06

Change to browse by:

References & Citations

NASA ADS

Google Scholar

Semantic Scholar

Data provided by:

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)