AI News HubLIVE
站内改写

Stability AI Releases Stable Audio 3: A Family of Fast Latent Diffusion Models for Audio Generation and Editing

Stability AI has released Stable Audio 3, a family of latent diffusion models for generating and editing stereo audio at 44.1 kHz. The models come in three scales (small, medium, large) with open weights for small and medium. Key innovations include a highly compressed SAME autoencoder, variable-length generation, and a three-stage training pipeline combining flow matching, distillation, and adversarial post-training. The models achieve state-of-the-art results on music and sound effects benchmarks while supporting inpainting-based audio editing.

Article intelligence

EngineersAdvanced

Key points

  • Stable Audio 3 generates stereo audio at 44.1 kHz with variable-length outputs and supports inpainting-based editing.
  • The models are available in three scales: small (music or SFX), medium (both), and large (enterprise). Open weights are provided for small and medium.
  • Key technical innovations include the SAME autoencoder with 4096x compression and a three-stage training pipeline (flow matching, distillation warmup, adversarial post-training).
  • Stable Audio 3 achieves competitive FAD and CLAP scores on music and sound effects benchmarks and does not require classifier-free guidance at inference.

Why it matters

This matters because stable Audio 3 generates stereo audio at 44.1 kHz with variable-length outputs and supports inpainting-based editing.

Technical impact

May affect model selection, inference cost, product capability, and evaluation benchmarks.

Stability AI has released open weights for Stable Audio 3 along with a technical research paper. Stable Audio 3 is a family of latent diffusion models that generate stereo audio at 44.1 kHz. The models support variable-length outputs, inpainting-based editing, and fast inference.

What Is Stable Audio 3?

Stable Audio 3 is a family of three model scales: small, medium, and large. A latent diffusion model generates audio by learning to progressively remove noise from a compressed representation of audio, called a latent. The model learns a mapping from noise to data by training on many (noisy latent, audio) pairs.

The three model scales differ in capacity and maximum generation length. All parameter counts below are for the diffusion transformer component only. Each model also includes a SAME autoencoder (108M parameters for SAME-S, 852M for SAME-L).

small-music — 459M diffusion transformer parameters, up to 2 minutes, music only.

small-sfx — 459M diffusion transformer parameters, up to 2 minutes, sound effects only.

medium — 1.4B diffusion transformer parameters, up to 6 minutes and 20 seconds, music and sound effects.

large — 2.7B diffusion transformer parameters, up to 6 minutes and 20 seconds, music and sound effects.

Open weights for small and medium are available on Hugging Face. Large is available under an enterprise license.

Architecture: Two Components

Stable Audio 3 has two main components: a semantic-acoustic autoencoder called SAME, and a diffusion transformer that generates latent sequences conditioned on text, duration, and inpainting masks.

https://arxiv.org/pdf/2605.17991

The SAME Autoencoder

SAME (Semantically-Aligned Music autoEncoder) converts stereo 44.1 kHz audio into a compact latent representation and back. Its key design parameter is a 4096× downsampling ratio — substantially higher than the 1024× to 2048× ratios common in prior audio autoencoders. This higher ratio reduces latent sequence lengths enough for long-form generation to run on consumer hardware.

SAME achieves its 4096× compression through two stages. First, a patching stage reshapes stereo audio into non-overlapping patches of 256 samples per channel, achieving 256× downsampling. Second, a Transformer Resampling Block (TRB) applies a further 16× downsampling using learnable output embeddings interleaved with the input sequence, processed through a transformer. The combined output is a 256-dimensional latent sequence at approximately 10.76 Hz for a 44.1 kHz input.

The SAME autoencoder is trained with five loss types: spectral reconstruction, adversarial, diffusion alignment, semantic regression (predicting chroma and interaural level difference), and contrastive latent alignment. These losses push the latent to preserve both acoustic reconstruction quality and semantic structure. A soft-normalisation bottleneck constrains the scale of the latent, providing deterministic encoding.

The SAME autoencoder is frozen during diffusion training. Small models use SAME-S (108M parameters, optimized for CPU inference); medium and large use SAME-L (852M parameters).

The Diffusion Transformer

The diffusion transformer operates on SAME latents. Conditioning enters through three pathways:

Text — a frozen T5Gemma encoder produces a sequence of 256 embeddings of dimension 768. Short prompts are padded to 256 with a learned embedding; long prompts are truncated.

Duration — encoded as a Fourier features vector and injected via both Adaptive Layer Normalization (AdaLN) and cross-attention alongside the text prompt.

Inpainting — a binary mask concatenated with the masked reference audio is projected through a 2-layer MLP and added to the residual stream of each transformer block.

Each transformer block contains self-attention, cross-attention, local-additive conditioning for inpainting, and a SwiGLU feed-forward network. Medium and large use differential attention, which computes two separate attention maps using two (Q, K) pairs sharing one set of values V, then subtracts one map from the other. This cancels attention patterns that are common to both heads. The transformer prepends 64 learnable memory embeddings before processing each sequence. These provide a global context buffer that every position can attend to, and are removed before computing any loss.

Variable-Length Generation

Most prior latent diffusion models for audio operate at a fixed maximum sequence length. Generating a short clip still requires running inference at full length, wasting compute on silence. Stable Audio 3 is trained to generate audio at variable lengths natively, using three mechanisms:

Variable-length flash attention and masked loss — sequences shorter than the batch maximum are right-padded in latent space. Padding positions are excluded from self-attention and from the loss.

Per-element timestep shifts — longer sequences retain more structure at a given noise level due to redundancy between neighboring elements. To compensate, the noise schedule is shifted toward higher noise levels for longer sequences during training, using a logistic shift parameterized by µ (interpolating between µmin=0.5 and µmax=1.15 based on sequence length).

Silence augmentation — the signal region is randomly extended with pre-computed silence embeddings drawn from an exponential distribution, averaging 4 seconds. This teaches the model to terminate audio with natural silence.

The practical result is that inference cost scales with output duration. Medium generates 20 seconds of audio in approximately 0.62 seconds on an H200. Generating 380 seconds takes 1.31 seconds on the same hardware.

Three-Stage Training Pipeline

Stage 1 — Flow Matching Pre-Training. The model learns a velocity field that transports Gaussian noise toward audio latents. Training uses minibatch optimal transport coupling via Sinkhorn iterations, which pairs each data sample with the closest available noise vector in the batch. This straightens training trajectories and reduces crossing transport paths. Inpainting is trained jointly throughout: at each step, one of three mask types is sampled — full mask (80%, equivalent to unconditional generation), random segment masks (10%), or a causal prefix mask for continuation (10%).

Stage 2 — Distillation Warmup. A frozen copy of the flow matching model (teacher) generates 15-step DPM++ trajectories with CFG scale 5. The student is trained for 10,000 steps to map any intermediate noisy state directly to the teacher’s final denoised output in one step, using an MSE loss. This collapses the multi-step ODE into a single-step denoiser. The trade-off is that MSE regression produces outputs that regress toward the conditional mean, reducing fine-grained detail.

Stage 3 — Adversarial Post-Training. This stage replaces the MSE objective with a relativistic adversarial setup. A discriminator (initialized from the base flow matching model) evaluates the student’s one-step denoised outputs directly against real data. The teacher is discarded entirely at this stage. The generator is trained with two losses: a relativistic adversarial loss (L_R) and a CLAP alignment loss (L_CLAP). The discriminator is trained with L_R and a contrastive loss (L_C) that penalizes the discriminator for ignoring text-audio alignment (it is trained to distinguish correctly paired audio-text pairs from shuffled ones). The adversarial setup allows the model to recover the perceptual sharpness that MSE distillation removes.

Inference: Ping-Pong Sampling and No CFG

The post-trained model can generate audio in a single forward pass. However, single-step generation from pure noise remains difficult. Stable Audio 3 uses ping-pong sampling at inference: the model denoises to a clean estimate, then adds new noise at a reduced level, then denoises again. This repeats for 8 steps using a logSNR-uniform schedule (N+1 equally-spaced steps in the interval [λmin, λmax] = [−6.2, 2.0]). The iterative denoise-then-renoise schedule allows each step to correct errors from the previous step.

Stable Audio 3 does not require classifier-free guidance (CFG) at inference. Standard diffusion models run two forward passes per step — one conditional, one unconditional — and interpolate. Here, CFG quality gains are internalized during distillation warmup, where the student is trained to match CFG-enhanced teacher trajectories. Text-audio alignment is further reinforced through L_CLAP during adversarial post-training. This eliminates the two-pass-per-step cost of CFG.

Prompt formatting note: All Stable Audio 3 models trained on AudioSparx (small-music, medium, large) require prompt prefixes to function correctly. Music prompts should be prepended with "TrackType: Music, VocalType: Instrumental," and sound effects prompts with "TrackType: SFX,".

Evaluation Results

Instrumental music (Song Describer Dataset, 120s). On FAD (lower is better) and CLAP score (higher is better), large achieves FAD 0.101 / CLAP 0.393. Medium achieves FAD 0.107 / CLAP 0.390. Stable Audio 2.5 (the internal prior-generation baseline) achieves FAD 0.106 / CLAP 0.395. In the listening test, medium and large score higher on musicality (MUS) than Stable Audio 2.5 (4.15 and 4.30 vs. 3.70 out of 5, respectively). Inference time for 120s audio on an H200: 0.45s for small, 0.78s for medium, 0.81s for large. Stable Audio 2.5 takes 0.85s for the same length.

Sound effects (BBC Sound Effects Dataset, 5s). Medium achieves FAD 0.369 / CLAP 0.369. The next-best open-weight baselines are Stable Audio Open Small (FAD 0.500 / CLAP 0.277) and Stable Audio Open (FAD 0.501 / CLAP 0.263). Woosh Flow scores FAD 0.580.

Audio editing (inpainting). The research team evaluates three inpainting settings: single region, two independent regions, and continuation. For music, medium achieves FAD-full of 0.046 on single inpainting and 0.046 on double inpainting. Large achieves 0.047 on both. For continuation, medium achieves FAD-full 0.074 and large achieves 0.071. Sound effects results follow a similar pattern; continuation shows higher FAD than inpainting in both domains, which the team attributes to the model having less surrounding audio context to anchor the generation.

Comparison

Model specs

Music benchmarks (SDD, 120s)

SFX benchmarks (BBC, 5s)

ModelDeveloperReleasedArchitecture ParametersMax lengthSample rate DomainOpen weightsInpainting

STABLE AUDIO LINEAGE

Stable Audio OpenStability AIJul 2024 Latent diffusion (DiT) DiT 1057M + AE 156M + T5 109M 47s44.1kHz stereoMusic + SFX YesNo

Stable Audio Open SmallStability AI2024 Latent diffusion (DiT) Not published 11s44.1kHz stereoSFX YesNo

Stable Audio 2.5Stability AIInternal Latent diffusion (DiT) Not published 190s (3m 10s)44.1kHz stereoMusic Not releasedNo

SA3 small-music ★Stability AIMay 2026 Latent diffusion (SAME + DiT) DT 459M + SAME-S 108M 2m44.1kHz stereoMusic only YesYes

SA3 small-sfx ★Stability AIMay 2026 Latent diffusion (SAME + DiT) DT 459M + SAME-S 108M 2m44.1kHz stereoSFX only YesYes

SA3 medium ★Stability AIMay 2026 Latent diffusion (SAME + DiT) DT 1.4B + SAME-L 852M 6m 20s44.1kHz stereoMusic + SFX YesYes

SA3 large ★Stability AIMay 2026 Latent diffusion (SAME + DiT) DT 2.7B + SAME-L 852M 6m 20s44.1kHz stereoMusic + SFX EnterpriseYes

COMPETITORS

TangoFluxSUTD / NVIDIA / LambdaDec 2024 Flow matching (DiT + MMDiT) 515M 30s44.1kHzSFX Yes (Apache 2.0)No

Woosh FlowSony AIApr 2026 Flow matching Not published 5sNot disclosedSFX Yes (MIT)No

Woosh DFlowSony AIApr 2026 Distilled flow matching Not published 5sNot disclosedSFX Yes (MIT)No

DiffRhythm 2ASLP Lab (NPU)Oct 2025 Block flow matching (semi-autoregressive) Not published 210s (3m 30s)48kHz outputMusic + vocals YesNo

ACE-Step 1.5ACE Studio / StepFunJan 2026 Hybrid LM (0.6B–4B) + DiT (up to 4B) LM 0.6B–4B + XL DiT 4B 10mNot disclosedMusic + vocals + lyrics YesNo

★ SA3 rows: Parameter counts are for the diffusion transformer (DT) component only; SAME autoencoder pa

[truncated for AI cost control]