Representation-Conditioned Diffusion Models for Guided Training Data Generation
This work proposes representation-conditioned diffusion models that leverage learned representations from DINOv2, DINOv3, and CLIP to generate synthetic image data. On ImageNet100, this approach outperforms class-conditioned generation by +10.76 p.p. top-1 accuracy. Scaling synthetic data can even surpass real-data training by +2.0 p.p. The method also excels in data augmentation and sample filtering, offering a promising way to augment or replace real datasets in large-scale visual learning.
Article intelligence
Key points
- Representation-conditioned diffusion models outperform class-conditioned ones by 10.76 p.p. on ImageNet100.
- Scaled synthetic datasets can beat real-data-trained classifiers by 2.0 p.p. top-1 accuracy.
- Generated images enhance augmentation beyond classical methods; conditioning space allows sample filtering.
- The approach has potential to supplement or replace real-world datasets, alleviating data scarcity.
Why it matters
This matters because representation-conditioned diffusion models outperform class-conditioned ones by 10.76 p.p. on ImageNet100.
Technical impact
May affect model selection, inference cost, product capability, and evaluation benchmarks.
[2605.27495] Representation-Conditioned Diffusion Models for Guided Training Data Generation
[Submitted on 26 May 2026]
Title:Representation-Conditioned Diffusion Models for Guided Training Data Generation
View a PDF of the paper titled Representation-Conditioned Diffusion Models for Guided Training Data Generation, by Nithesh Chandher Karthikeyan and 2 other authors
View PDF HTML (experimental)
Abstract:Data availability remains a critical bottleneck in many deep learning applications. Large-scale datasets are often expensive to collect, curate and annotate, which can limit the scalability and applicability of supervised learning methods. In this work, we evaluate the classification performance of models trained on synthetic image datasets produced by generative deep learning. In particular, we use latent diffusion models conditioned on learned representations from DINOv2, DINOv3, and CLIP. Our results demonstrates that this representation-conditioned formulation significantly outperforms class-conditioned generation by a large margin (+10.76 p.p. top-1 accuracy on ImageNet100), by improving sample quality and mode coverage. Furthermore, by scaling the size of the synthetic dataset, we are able to outperform a classifier trained on the real data (+2.0 p.p top-1 accuracy).
We also demonstrate how generated images can be used for augmentation purposes, outperforming classical augmentation methods, and how the conditioning space can be used for sample filtering to further improve training value. Collectively, these findings highlight that representation-conditioned diffusion models provide a promising approach for augmenting, complementing, or potentially replacing real-world datasets in large-scale visual learning tasks.
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as: arXiv:2605.27495 [cs.CV]
(or arXiv:2605.27495v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2605.27495
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Nithesh Chandher Karthikeyan Mr [view email] [v1] Tue, 26 May 2026 17:32:50 UTC (299 KB)
Full-text links:
Access Paper:
View a PDF of the paper titled Representation-Conditioned Diffusion Models for Guided Training Data Generation, by Nithesh Chandher Karthikeyan and 2 other authors
View PDF
HTML (experimental)
TeX Source
view license
Current browse context:
cs.CV
new | recent | 2026-05
Change to browse by:
cs cs.LG
References & Citations
NASA ADS
Google Scholar
Semantic Scholar
Loading...
Data provided by:
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Code, Data and Media Associated with this Article
alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos
Demos
Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
TXYZ.AI (What is TXYZ.AI?)
Related Papers
Recommenders and Search Tools
Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
Author
Venue
Institution
Topic
About arXivLabs
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.
Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)