Designing synthetic datasets for the real world: Mechanism design and reasoning from first principles
Google introduces Simula, a reasoning-first framework that treats synthetic data generation as mechanism design, enabling fine-grained control over coverage, complexity, and quality for specialized AI domains.
Designing synthetic datasets for the real world: Mechanism design and reasoning from first principles
Jump to Content
Research
Research
Who we are
Back to Who we are menu
Defining the technology of today and tomorrow.
Philosophy
We strive to create an environment conducive to many different types of research across many different time scales and levels of risk.
Learn more about our Philosophy Learn more
Philosophy
People
Our researchers drive advancements in computer science through both fundamental and applied research.
Learn more about our People Learn more
People
Research areas
Back to Research areas menu
Research areas
Explore all research areas
Research areas
Back to Research areas menu
Explore all research areas
Foundational ML & Algorithms
Algorithms & Theory
Data Management
Data Mining & Modeling
Information Retrieval & the Web
Machine Intelligence
Machine Perception
Machine Translation
Natural Language Processing
Speech Processing
Foundational ML & Algorithms
Back to Foundational ML & Algorithms menu
Algorithms & Theory
Data Management
Data Mining & Modeling
Information Retrieval & the Web
Machine Intelligence
Machine Perception
Machine Translation
Natural Language Processing
Speech Processing
Computing Systems & Quantum AI
Distributed Systems & Parallel Computing
Hardware & Architecture
Mobile Systems
Networking
Quantum Computing
Robotics
Security, Privacy, & Abuse Prevention
Software Engineering
Software Systems
Computing Systems & Quantum AI
Back to Computing Systems & Quantum AI menu
Distributed Systems & Parallel Computing
Hardware & Architecture
Mobile Systems
Networking
Quantum Computing
Robotics
Security, Privacy, & Abuse Prevention
Software Engineering
Software Systems
Science, AI & Society
Climate & Sustainability
Economics & Electronic Commerce
Education Innovation
General Science
Health & Bioscience
Human-Computer Interaction and Visualization
Responsible AI
Science, AI & Society
Back to Science, AI & Society menu
Climate & Sustainability
Economics & Electronic Commerce
Education Innovation
General Science
Health & Bioscience
Human-Computer Interaction and Visualization
Responsible AI
Our work
Back to Our work menu
Projects
We regularly open-source projects with the broader research community and apply our developments to Google products.
Learn more about our Projects Learn more
Projects
Publications
Publishing our work allows us to share ideas and work collaboratively to advance the field of computer science.
Learn more about our Publications Learn more
Publications
Resources
We make products, tools, and datasets available to everyone with the goal of building a more collaborative ecosystem.
Learn more about our Resources Learn more
Resources
Programs & events
Back to Programs & events menu
Shaping the future, together.
Collaborate with us
Student programs
Supporting the next generation of researchers through a wide range of programming.
Learn more about our Student programs Learn more
Student programs
Faculty programs
Participating in the academic research community through meaningful engagement with university faculty.
Learn more about our Faculty programs Learn more
Faculty programs
Conferences & events
Connecting with the broader research community through events is essential for creating progress in every aspect of our work.
Learn more about our Conferences & events Learn more
Conferences & events
Collaborate with us
Careers
Blog
Search
play silent looping video pause silent looping video
unmute video mute video
Home
Blog
Designing synthetic datasets for the real world: Mechanism design and reasoning from first principles
April 16, 2026
Tim R. Davidson, Student Researcher, and Hamza Harkous, Senior Staff Research Scientist, Google
To address the scarcity of data required for specialized AI, we introduce Simula, a framework that reframes synthetic data generation as dataset-level mechanism design. By using reasoning to architect datasets from first principles, Simula enables fine-grained control over coverage, complexity, and quality, providing scalable generation for privacy-sensitive or data-scarce domains.
Quick links
Paper
Share
Copy link
×
The rapid advance of generalist AI models has been fueled by the abundance of internet data. However, widespread integration of AI will require models to specialize in novel, uncommon, and privacy-sensitive applications where data is inherently scarce or inaccessible.
To bridge this gap, reliance on real-world data imposes significant limitations:
Cost and accessibility: Creating specialized datasets manually is prohibitively expensive, time-consuming, and error-prone.
Operational drag: The static nature of real-world data slows development cycles. In contrast, a synthetic-first approach enables "programmable workflows" where data is treated like code — versioned, reproducible, and inspectable.
Preparedness: We cannot afford a reactive approach to topics like safety, where models can be hardened only after failures occur. Synthetic data allows us to proactively generate edge cases and stress-test systems against scenarios that have not yet happened in the wild.
While synthetic data is a promising alternative, current generation methods often lack the rigor required for production-scale deployment. Many existing approaches rely on manual prompts, evolutionary algorithms, or extensive seed data from the target distribution.
These methods limit scalability (due to reliance on seeds or human effort), explainability (due to black-box evolutionary steps), and control (due to entangled generation parameters). Most critically, they typically operate at the sample level — optimizing one data point at a time — rather than designing the dataset as a whole.
To solve this, we need to reframe synthetic data generation as a problem of mechanism design. Production use cases require a focus beyond just "more data"; they require fine-grained resource allocation where coverage, complexity, and quality are independently controllable variables.
Simula: A reasoning-first framework
In our paper, “Reasoning-Driven Synthetic Data Generation and Evaluation”, published in Transactions on Machine Learning Research, we introduce Simula. Unlike methods that rely on opaque processes, Simula employs a "reasoning-first" methodology, constructing entire datasets from first principles. This approach is seedless and agentic, allowing the generation capabilities to improve naturally as the reasoning capabilities of the underlying models advance.
Controlling the axes of data generation
Simula decomposes the generation process into distinct, controllable axes, using four steps:
Global Diversification: Instead of random sampling, Simula uses reasoning models to map the conceptual space of a target domain into deep, hierarchical taxonomies. This acts as a "sampling scaffold". By defining sampling strategies over these taxonomies, we can control global diversity — ensuring the dataset covers the long tail of a domain rather than clustering around common modes.
play silent looping video pause silent looping video
unmute video mute video
To map the conceptual space of a target domain without relying on human seed data, Simula employs a reasoning-driven, recursive expansion process. At each depth level, the system generates multiple candidate sub-categories (proposals) that are subsequently evaluated, merged, and filtered by a critic model. This iterative "propose-and-refine" loop dynamically builds a dense, hierarchical taxonomy — such as the Cyber Threat Intelligence tree — that serves as the foundational scaffold to ensure global dataset diversity.
Equipped with a set of deep taxonomies, we can now start mapping out our coverage space of interest and optimize (2) local diversity, (3) complexity, and (4) quality:
- Local Diversification: To ensure variation within specific concepts, we employ local diversity mechanisms. The system generates "meta-prompts" — scenarios derived from taxonomy nodes — and then produces multiple distinct instantiations of that scenario. This prevents mode collapse, ensuring that a concept like "SQL injection" is represented through diverse framings rather than identical repetitions.
- Complexification: Complexity is treated as an orthogonal axis. We use a "complexification" step where a configurable fraction of meta-prompts is refined to be more elaborate or difficult. This allows practitioners to shift the difficulty distribution of a dataset without changing its semantic coverage.
- Quality Checks: To ensure correctness without human intervention, we employ a "dual-critic" loop that independently assesses if an answer is correct or incorrect. This dual-verification helps mitigate sycophancy (where models tend to agree with plausible-sounding outputs) and ensures high-quality labels.
play silent looping video pause silent looping video
unmute video mute video
Simula frames synthetic data creation as a mechanism design problem, decomposing the process into distinct, controllable axes. First, Global Diversification leverages taxonomies to ensure broad domain coverage. Second, Local Diversification uses 1-of-N meta-prompting to instantiate distinct scenarios and prevent mode collapse. Third, Complexification optionally refines these scenarios to elevate difficulty and detail. Finally, Quality Checks utilize a dual-critic loop to verify that all outputs meet semantic and structural constraints.
Addressing challenges in evaluation
The evaluation of synthetic data is fundamentally challenging due to the ambiguity of its core objectives and the disconnect between standard metrics and practical utility. Standard metrics like embedding-based cosine distance provide a high-level signal but offer limited actionable insights.
To make evaluations more robust, we apply our reasoning-first approach here as well. Specifically, we introduce reasoning-based metrics — Taxonomic Coverage and Calibrated Complexity Scoring (which uses LLM-driven batch comparisons to assign chess-style "Elo ratings" to individual data points) — to better capture the nuances of diversity and difficulty.
No universal solution
We used Gemini 2.5 Flash as a teacher model and Gemma-3 4B as a student to evaluate Simula across five diverse domains — from cybersecurity (CTI-MCQ, CTI-RCM from CTIBench) and legal reasoning (LEXam), to standard AI model evaluations such as grade-school math (GSM8k) and multilingual academic knowledge (Global MMLU). Generating datasets of up to 512K data points for each domain, our results highlight a critical reality: there is no single "optimal" way to generate data, and the relationship between "good" data and downstream performance is deeply idiosyncratic.
Mechanism design is non-negotiable: Across all domains, the full Simula system — which combines global coverage, local diversity, and critiquing — consistently outperformed simpler baselines.
Context is king: There are no fixed recipes. While high complexity yielded a 10% accuracy gain in math reasoning (GSM8k), it actually hurt performance in legal reasoning (LEXam) where the teacher model was weaker. Data must be tailored to the capabilities of the model consuming it.
Quality is the new quantity: Better data scales better. Simula achieved higher downstream performance with fewer samples compared to baseline approaches, confirming that scaling laws are driven by data properties, not just volume.
While this was a distillation setup, chosen for replicable, systemic evaluation, the core lessons learned extend beyond this specific configuration.
Downstream performance on different datasets.
From research to real-world impact
Simula was not just built to optimize benchmarks, it serves as a foundational data engine for real-world, business-critical applications across Google. Within the frontier AI space, it has been a key enabler for the Gemma ecosystem — including specialized models like ShieldGemma, FunctionGemma, and MedGemma — while providing the primary synthetic data backbone for both on-device and server-side Gemini safety classifiers. Beyond foundation models, Simula has been instrumental in shipping user protection features, including AI-powered scam detection for Android calls and spam filtering in Google Messages. Furthermore, Simula is actively driving new applied research, facilitating frameworks that democratize ML for enterprise security by synthesizing realistic attack scenarios, and enabling breakthroughs like teaching AI models to read maps through structured, reasoning-driven dataset generation.
Synthetic data's central role in specialized AI
AI progress is at a junction. The specialized data required for the next wave of breakthroughs — in science, security, and law — is unlikely to be generated by humans at the necessary scale. Synthetic data is primed to play a central role in these leaps, but only if approached with rigor. Ultimately, Simula's value lies in demonstrating how mechanism design can make data generation a controllable science. This blueprint provides a clear path to building the high-fidelity datasets the next era of AI demands — whether we are distilling knowledge into edge devices, training agents via reinforcement learning, or systematically exploring complex edge-cases.
Acknowledgements
This research was authored by Tim R. Davidson, Benoit Seguin, Enrico Bacis, Cesar Ilharco, and Hamza Harkous. The Simula framework was founded and led by Hamza and Benoit. Special thanks go to Tim for his significant contributions during his student researcher tenure. We also thank Jan Keller for his TPM support and Coran Corbett and Ninny Wan for their vital technical and product partnerships. Finally, we thank Nina Taft, Amanda Walker, and Pankaj Rohatgi for their sponsorship and support.
Labels:
Generative AI
Machine Intelligence
Natural Language Processing
Quick links
Paper
Share
Copy link
×
Other posts of interest
April 29, 2026
Four ways Google Research scientists have been using Empirical Research Assistance
Data Mining & Modeling
·
General Science
·
Generative AI
·
Machine Intelligence
April 22, 2026
It's all about the angle: Your photos, re-composed
Generative AI
·
Photography
April 21, 2026
ReasoningBank: Enabling agents to learn from experience
Generative AI
·
Machine Intelligence
·
Natural Language Processing
× ❮ ❯
Follow us
About Google
Google Products
Privacy
Terms
Help
Submit feedback
×