2026-06-05 04:00 UTCOriginal source2 min readUpdated: 2026-06-30 13:03 UTC

ERRORQUAKE: Heavy-Tailed Error Severity Distributions in Open-Weight Large Language Models

At matched accuracy, open-weight LLMs differ substantially in the shape of their error severity distribution — a difference invisible to the scalar error rate. The Errorquake-10k benchmark scores each response on a continuous 0-4 severity scale across 8 domains and 5 difficulty tiers, revealing that severity profiles provide information beyond error rate.

SourcearXiv Machine LearningAuthor: Jason Z Wang

Article intelligence

EngineersAdvanced

Key points

Errorquake-10k benchmark scores LLM responses on a 0-4 severity scale, revealing heavy-tailed severity distributions.
Many model pairs show significantly different severity distributions at matched accuracy, indicating that error rate alone is insufficient.
Severity distribution is proven to be informationally non-redundant with error rate, providing discriminative information.
Error types shift with severity: low-severity errors are mostly retrieval errors, high-severity errors are fabrications.

Why it matters

This matters because errorquake-10k benchmark scores LLM responses on a 0-4 severity scale, revealing heavy-tailed severity distributions.

Technical impact

May affect model selection, inference cost, product capability, and evaluation benchmarks.

This panel is AI-generated and reviewed for accuracy.

[2606.05170] ERRORQUAKE: Heavy-Tailed Error Severity Distributions in Open-Weight Large Language Models

[Submitted on 15 Apr 2026]

Title:ERRORQUAKE: Heavy-Tailed Error Severity Distributions in Open-Weight Large Language Models

View a PDF of the paper titled ERRORQUAKE: Heavy-Tailed Error Severity Distributions in Open-Weight Large Language Models, by Jason Z Wang

View PDF HTML (experimental)

Abstract:At matched accuracy, open-weight LLMs differ substantially in the shape of their error severity distribution -- a difference invisible to the scalar error rate. Hallucination benchmarks report a single error count and treat all errors as equivalent, yet a wrong date and a fabricated court ruling differ by orders of magnitude. We introduce Errorquake-10k, a 10,000-query benchmark scoring each response on a continuous 0-4 severity scale across 8 domains and 5 difficulty tiers, and we fit per-model severity distributions for 21 open-weight models. For each model we estimate a severity distribution index (b, the Gutenberg-Richter upper-tail slope) with 95% bootstrap confidence intervals. Headline: across the 210 model pairs, 85 have disjoint 95% b confidence intervals at matched accuracy (|Delta epsilon|

new | recent | 2026-06

Change to browse by:

References & Citations

NASA ADS

Google Scholar

Semantic Scholar

Data provided by:

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)