2026-06-20站内改写5 min readUpdated: 2026-06-20

Diffusion‑based LLMs that generate many parallel tokens rather than one‑by‑one

Inception builds next-generation LLMs powered by diffusion, enabling parallel token generation for faster speed and lower cost. Their Mercury models (Mercury 2 for reasoning, Mercury Edit 2 for code) achieve dramatic latency and cost reductions, deployed at Fortune 500 companies.

SourceHacker News AIAuthor: binyu

Article intelligence

EngineersAdvanced

Key points

Inception uses diffusion models instead of autoregressive generation, allowing multiple tokens to be generated in parallel. This results in several times faster speed and less than half the cost.
Mercury 2 is the first reasoning diffusion LLM, and Mercury Edit 2 is a coding-focused dLLM optimized for low latency.
The technology is already deployed at Fortune 500 companies, with customers reporting 82% lower latency and 90% cost reduction for summarization tasks.
The team includes leading researchers from Stanford, Google DeepMind, OpenAI, and other top institutions, with breakthroughs in diffusion models, Flash Attention, and DPO.

Why it matters

This matters because inception uses diffusion models instead of autoregressive generation, allowing multiple tokens to be generated in parallel. This results in several times faster speed and less than half the cost.

Technical impact

May affect model selection, inference cost, product capability, and evaluation benchmarks.

Inception – A new frontier in LLM speed

Mercury 2 and the Rise of Real-time Subagents

Learn more

Mercury 2 and the Rise of Real-time Subagents

A new frontier in LLM speed

A new frontier in LLMs speed

A new frontier in LLM speed

Inception’s breakthrough diffusion-based approach to language generation enables the world’s fastest, most efficient AI models with best-in-class quality.

Inception builds and deploys next‑generation large language models (LLMs) that are powered by diffusion rather than traditional auto‑regressive generation. By using diffusion, their models can produce many tokens in parallel, making them several times faster and less than half the cost of conventional LLMs. The diffusion framework also provides fine‑grained control over outputs, allowing adherence to specific schemas and semantic constraints. Additionally, it offers a unified paradigm for combining language with other data modalities such as audio, images, and video. The company’s team includes leading researchers and engineers from Stanford, UCLA, Cornell, Google DeepMind, Meta AI, Microsoft AI, and OpenAI, and they are currently deploying these diffusion LLMs at Fortune 500 companies.

Explain what Inception does

Here are some prompts you can try with a diffusion-style LLM:

Explain a complex topic step by step, showing intermediate reasoning.
Generate multiple variations of a product tagline and refine them progressively.
Write a short story that improves its wording over several iterations.
Brainstorm startup ideas and evolve the best one through revisions.
Refactor a piece of code and show incremental improvements.
Describe an image concept and refine the details in stages.
Compare two technologies with increasingly deeper analysis.
Draft a landing page headline and iterate toward a clearer version.
Simulate a design critique that becomes more precise each step.
Turn rough notes into a polished summary through gradual refinement.

Suggest 10 prompts for a dLLM

Here are some prompts you can try with a diffusion-style LLM:

Explain a complex topic step by step, showing intermediate reasoning.
Generate multiple variations of a product tagline and refine them progressively.
Write a short story that improves its wording over several iterations.
Brainstorm startup ideas and evolve the best one through revisions.
Refactor a piece of code and show incremental improvements.
Describe an image concept and refine the details in stages.
Compare two technologies with increasingly deeper analysis.
Draft a landing page headline and iterate toward a clearer version.
Simulate a design critique that becomes more precise each step.
Turn rough notes into a polished summary through gradual refinement.

Suggest 10 prompts for a dLLM

Create a Javascript animation

Trusted by teams at

The Mercury diffusion models introduce blazing fast inference with frontier quality at a fraction of the cost of other top-tier models.

Read our research

Speed Benchmark

Tokens/sec

Speed Benchmark

Tokens/sec

The diffusion difference. From sequential to parallel

All other LLMs generate text one token at a time. Mercury diffusion LLMs (dLLMs) generate tokens in parallel, increasing speed and maximizing GPU efficiency.

Parallel Generation

Mercury

zap

mango

crisp

lunar

wobble

spin

felt

droop

echo

Sequential Generation

ChatGPT

The

Quick

Brown

Fox

Jumps

Over

The

Lazy

Dog

Parallel Generation

Mercury

zap

mango

crisp

lunar

wobble

spin

felt

droop

echo

Sequential Generation

ChatGPT

The

Quick

Brown

Fox

Jumps

Over

The

Lazy

Dog

Blazing-fast performance you can notice

Write code

Real-Time Voice

Instant Agents

Write code

Real-Time Voice

Instant Agents

Build the future of AI apps with Mercury

Get Started

Lightning fast agents

Automate complex coding and other business workflows with with ultra-responsive AI.

Real-time voice

Engage naturally with AI in voice-powered workflows like customer support, translation, and immersive gaming.

Instant code editing

Stay in-the-flow with responsive autocomplete, intelligent tab suggestions, and fast chat responses.

Fast, creative co-pilots

Supercharge editorial and creative work—less waiting, more creating.

Rapid search

Instantly surface the right data from across your organization’s knowledge base.

Foundational models

Meet our family of diffusion models

Mercury 2

Get Started

Docs

The fastest reasoning LLM and the first reasoning dLLM. Ideal for complex applications where performance and speed are crucial.

Input $0.25 per 1M tokens

Output $0.75 per 1M tokens

Mercury 2

The fastest reasoning LLM and the first reasoning dLLM. Ideal for complex applications where performance and speed are crucial.

Input $0.25 per 1M tokens

Output $0.75 per 1M tokens

Early Access

Read API Docs

Mercury Edit 2

Get Started

Docs

A small, coding-focused dLLM. Ideal for code editing and other extremely latency-sensitive components of coding workflows.

Input $0.25 per 1M tokens

Output $0.75 per 1M tokens

Mercury Edit 2

A small, coding-focused dLLM. Ideal for code editing and other extremely latency-sensitive components of coding workflows.

Input $0.25 per 1M tokens

Output $0.75 per 1M tokens

Early Access

Read API Docs

Research

Led by visionary AI researchers

Our founders pioneered diffusion modeling and invented cornerstone AI technologies.

Our Research

Diffusion Models

Read paper

The underlying approach for modern image and video generation, powering applications including Sora and MidJourney.

Flash Attention

Read paper

A key algorithm for efficient GPU utilization in LLM training and inference.

Direct Preference Optimization

Read paper

One of the core approaches for aligning LLMs with human feedback.

Loved by leaders and innovators

Book a Demo

Because Mercury 2 delivers the perfect threshold of intelligence at lightning speeds, the equation heavily works in our favor. We cut summarization latency by 82% and dropped costs by 90%.

Ankur Rustagi & John Mu

Because Mercury 2 delivers the perfect threshold of intelligence at lightning speeds, the equation heavily works in our favor. We cut summarization latency by 82% and dropped costs by 90%.

Ankur Rustagi & John Mu

After trying Mercury, it's hard to go back. We are excited to roll out Mercury to support all of our voice agents.

Oliver Silverstein, CEO

After trying Mercury, it's hard to go back. We are excited to roll out Mercury to support all of our voice agents.

Oliver Silverstein, CEO

Speed in a code editor isn't a nice-to-have. It's the difference between staying in flow and losing your train of thought. Mercury completions land fast enough to feel like part of the developer's own thinking, not an interruption to it.

Max Brunsfeld, Co-founder

Enterprise-grade privacy and reliability

We’re available through major cloud providers like AWS Bedrock and Azure Foundry. Talk with us about fine-tuning and private deployments.

Talk to Sales

Integrate in seconds

Our models are OpenAI API compatible and a drop-in replacement for traditional LLMs.

Enterprise AI partner

We’re available through major cloud providers like AWS Bedrock and Azure Foundry.

Reliability at scale

Get 99.5%+ uptime and priority support with custom SLAs.

The future of LLMs is here

Get Started

The future of LLMs is here

Get Started

Products

Get Started

Models

Pricing

Company

About Us

Research

Careers

Blog

Resources

Mercury Chat

API Platform

Documentation

Integrations

Partners

Legal

Contact

Sales

Inquires

Discord

Products

Get Started

Models

Pricing

Company

About Us

Research

Careers

Blog

Resources

Mercury Chat

API Platform

Documentation

Integrations

Partners

Legal

Contact

Sales

Inquires

Discord