2026-06-15原文2 min readUpdated: 2026-06-15

Can Post-Training Turn LLMs into Good Medical Coders? An Empirical Study of Generative ICD Coding

A new study explores how post-training methods like supervised fine-tuning and reinforcement learning can significantly improve generative LLMs for ICD coding, challenging the notion that LLMs are weak medical coders when evaluated solely via prompting.

SourcearXiv Computational LinguisticsAuthor: Ziqing Wang, Weihao Li, Shijie Chen, Yuan Luo, Kaize Ding

[2606.13940] Can Post-Training Turn LLMs into Good Medical Coders? An Empirical Study of Generative ICD Coding

[Submitted on 11 Jun 2026]

Title:Can Post-Training Turn LLMs into Good Medical Coders? An Empirical Study of Generative ICD Coding

View a PDF of the paper titled Can Post-Training Turn LLMs into Good Medical Coders? An Empirical Study of Generative ICD Coding, by Ziqing Wang and 4 other authors

View PDF HTML (experimental)

Abstract:Automated International Classification of Diseases (ICD) coding is a core medical-coding task for billing, epidemiology, and clinical decision support. Generative large language models (LLMs) are often reported as weak medical coders, but this finding mainly comes from inference-time settings such as prompting, retrieval, reranking, or tool use, leaving the role of task-specific post-training underexplored. We present a controlled empirical study of post-training for generative ICD coding, comparing discriminative baselines with LLM coders across prompting, supervised fine-tuning, and reinforcement learning under a common protocol and metric set. To our knowledge, this is the first study to evaluate RL-based post-training for generative LLM coders in ICD coding. We further introduce PHI, a diagnostic curriculum that extends GRPO to refine missed-code cases. Our results show that prompting-only evaluation substantially underestimates the potential of LLMs for ICD coding. SFT provides the main capability jump, GRPO further improves code-set prediction beyond SFT, and PHI provides targeted gains on macro-level performance. These findings suggest that the main bottleneck is not the generative formulation alone, but how the model is adapted and optimized for full-taxonomy recall. We release our code, data splits, and checkpoints at this https URL.

Subjects:

Computation and Language (cs.CL)

Cite as: arXiv:2606.13940 [cs.CL]

(or arXiv:2606.13940v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2606.13940

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Ziqing Wang [view email] [v1] Thu, 11 Jun 2026 22:04:50 UTC (625 KB)

Full-text links:

Access Paper:

View a PDF of the paper titled Can Post-Training Turn LLMs into Good Medical Coders? An Empirical Study of Generative ICD Coding, by Ziqing Wang and 4 other authors

View PDF

HTML (experimental)

TeX Source

view license

Current browse context:

cs.CL

new | recent | 2026-06

Change to browse by:

References & Citations

NASA ADS

Google Scholar

Semantic Scholar

Data provided by:

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)