2026-06-08 04:00 UTCOriginal source2 min readUpdated: 2026-06-30 13:03 UTC

Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation

This paper introduces On-Policy Diffusion Language Model (OPDLM), which transforms autoregressive models into diffusion language models via on-policy distillation, addressing distribution shifts. OPDLM achieves strong performance with 15x to 7,000x fewer training tokens across various tasks, positioning DLM transformation as a form of ARLM post-training.

SourcearXiv Computational LinguisticsAuthor: Xingyu Su, Jacob Helwig, Shubham Parashar, Atharv Chagi, Lakshmi Jotsna, Degui Zhi, James Caverlee, Dileep Kalathil, Shuiwang Ji

[2606.06712] Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation

[Submitted on 4 Jun 2026]

Title:Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation

View a PDF of the paper titled Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation, by Xingyu Su and 8 other authors

View PDF

Abstract:We study the transformation of autoregressive models (ARLMs) into diffusion language models (DLMs). Rather than pretraining from scratch, prior work replaces the causal attention in ARLMs with bidirectional attention and then trains the resulting model using a DLM objective. However, these approaches incur two distribution shifts. First, transitioning from a next-token prediction objective to a DLM objective can discard knowledge acquired by the ARLM during training. Second, standard DLMs suffer from a train-inference mismatch, as the training loss is defined on randomly masked sequences rather than the trajectories encountered at inference produced by confidence-based decoding. To address both challenges, we introduce an On-Policy Diffusion Language Model (OPDLM) in which On-Policy Distillation (OPD) is employed for ARLM-to-DLM transformation. Specifically, OPDLM is trained via self-OPD, where the student, an ARLM with bidirectional attention, generates its own trajectories, and the teacher, the original frozen ARLM, distills its knowledge by providing target logits on these trajectories. By training directly in an on-policy manner, OPDLM eliminates the train-inference mismatch in DLMs, while distillation from the original model enhances knowledge retention from the ARLM. Empirical results demonstrate that OPDLM requires 15x to 7,000x fewer training tokens with strong performance across a wide variety of tasks. OPDLM avoids the prohibitive cost of DLM pretraining and positions DLM transformation as a form of ARLM post-training.

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as: arXiv:2606.06712 [cs.CL]

(or arXiv:2606.06712v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2606.06712

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Xingyu Su [view email] [v1] Thu, 4 Jun 2026 20:58:08 UTC (1,688 KB)

Full-text links:

Access Paper:

View a PDF of the paper titled Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation, by Xingyu Su and 8 other authors

View PDF

TeX Source

view license

Current browse context:

cs.CL

new | recent | 2026-06

Change to browse by:

cs cs.AI

References & Citations

NASA ADS

Google Scholar

Semantic Scholar

Data provided by:

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)