2026-06-30 04:00 UTCOriginal source2 min readUpdated: 2026-06-30 07:51 UTC

Data and Evaluation Closed-Loop for Model Capability Enhancement

A new method called the 'capability slice' bridges the gap between evaluation and data in LLM pre-training, enabling targeted data interventions from benchmark failures. Tested on two case studies, it correctly distinguished a masked loss artifact from a genuine data issue.

SourcearXiv AIAuthor: Zhixuan Li, Jiangan Yuan, Han Xu

[2606.28471] Data and Evaluation Closed-Loop for Model Capability Enhancement

[Submitted on 26 Jun 2026]

Title:Data and Evaluation Closed-Loop for Model Capability Enhancement

View a PDF of the paper titled Data and Evaluation Closed-Loop for Model Capability Enhancement, by Zhixuan Li and 2 other authors

View PDF HTML (experimental)

Abstract:Model capability is the central variable in LLM pre-training, yet is never observed directly: data shapes it prospectively, while evaluation reveals it only retrospectively, compressing samples, prompts, decoding, and scoring rules into one noisy score. Practical optimization runs this backward: a failure is observed first, and the engineer must infer the corpus fix. The two sides speak incompatible vocabularies -- benchmark names and per-sample correctness versus data sources, domains, and quality labels -- so this inference is usually intuition, not method. We close this gap with the \emph{capability slice}: a group of evaluation samples sharing background condition, task type, solving operation, and output constraint -- precise enough to localize a single weakness yet stable enough to survive aggregation, unlike a benchmark name, too coarse, or a single sample, too noisy. Built around this unit, an evaluation taxonomy, a non-instruction data taxonomy, and mapping rules form a closed loop turning a benchmark-level failure into a targeted, testable data intervention. We test this loop on two case studies pulling in opposite directions. First, the loop rules the data out: continued pre-training drives BBH down by $-46.82\%$, but diagnosis traces this to a single masked \texttt{\textless EOS\textgreater} loss rather than weakened reasoning; restoring it recovers BBH to $66.44$, above the original checkpoint, without changing the data. Second, the loop rules the data in: a persistent math-reasoning weakness is decomposed by solving operation into specific failing combinations, and a weakness-targeted sampling procedure built from it lifts AIME2025/AIME2026 Pass@128 from $6.67$/$0.00$ to $26.67$ each. The same unmodified loop reaches opposite, correct verdicts in both cases, showing the evaluation-to-data inference can be routine, auditable, and experimentally validated rather than intuitive.

Subjects:

Artificial Intelligence (cs.AI)

Cite as: arXiv:2606.28471 [cs.AI]

(or arXiv:2606.28471v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2606.28471

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Zhixuan Li [view email] [v1] Fri, 26 Jun 2026 14:45:57 UTC (1,600 KB)

Full-text links:

Access Paper:

View a PDF of the paper titled Data and Evaluation Closed-Loop for Model Capability Enhancement, by Zhixuan Li and 2 other authors

View PDF

HTML (experimental)

TeX Source

view license

Current browse context:

cs.AI

new | recent | 2026-06

Change to browse by:

References & Citations

NASA ADS

Google Scholar

Semantic Scholar

Data provided by:

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)