2026-05-15 02:24 UTCIn-site rewrite6 min readUpdated: 2026-06-27 00:25 UTC

Scaling and Optimizing Frontier Model Training

Fireworks' blog post details how its Training SDK and optimizations (low-precision quantization, optimizer offloading, composable parallelism, Blackwell-native precision, and streaming pipeline parallelism) scale trillion-parameter MoE model training, supporting both LoRA and full-parameter modes across a wide model catalog.

SourceFireworks AI Blog

Article intelligence

EngineersAdvanced

Key points

Fireworks' Training SDK supports LoRA and full-parameter training for diverse MoE and dense models.
LoRA training fits trillion-parameter models on a single node via expert quantization and optimizer offloading.
Full-parameter training uses four-dimensional composable parallelism (FSDP, pipeline, context, expert) with Blackwell-native precision and fused loss computation.
Streaming pipeline parallelism eliminates batch accumulation bottlenecks, reducing first-result latency for RL workloads.

Why it matters

This matters because fireworks' Training SDK supports LoRA and full-parameter training for diverse MoE and dense models.

Technical impact

May affect model selection, inference cost, product capability, and evaluation benchmarks.

This panel is AI-generated and reviewed for accuracy.

Scaling and Optimizing Frontier Model Training

DeepSeek V4 Pro is Live → Try it now.

Blog

Scaling and Optimizing Frontier Model Training

How Fireworks scales frontier model training and offers the broadest set of fine-tunable MoE models on any platform.

PUBLISHED 4/3/2026

On this page

Scaling and Optimizing Frontier Model Training

How Fireworks scales frontier model training and offers the broadest set of fine-tunable MoE models on any platform.

From RL Rollouts to the Training Engine

Training trillion-parameter Mixture-of-Experts (MoE) models has historically been bottlenecked by memory walls and complex cluster orchestration. Earlier this month, Cursor released Composer 2 — a frontier coding model that tops CursorBench at 61.3, SWE-bench Multilingual at 73.7, and Terminal-Bench at 61.7. Fireworks powers the Reinforcement Learning (RL) inference infrastructure behind it, proving that these bottlenecks can be overcome at scale.

We have written about delta-compressed weight sync and multi-region rollout fleets, and about why numerical parity between training and inference is especially hard for MoE models. Those posts cover the inference half of the RL loop — rollouts, weight transfer, and numerical alignment.

This post covers the last missing piece: the trainer itself. Our Training SDK provides the model catalog, parallelism stack, precision kernels, and memory optimizations that make it possible to fine-tune trillion-parameter MoE models on current hardware.

What's Available Today

Our Training Shapes catalog supports both LoRA and full-parameter training across models in the Fireworks catalog. Customers pick a shape ID and call resolve_training_profile() — the Training SDK and API backend handles GPU layout, parallelism, and deployment bring-up automatically. Teams that want to start with managed fine-tuning and graduate to custom training loops can do so on the same platform.

Model Architecture Context Hardware

Qwen3.5 397B-A17B MoE 262K 32x B200

Qwen3.5 35B-A3B MoE 262K 8x B200

Qwen3 235B MoE 128K 16x B200

Qwen3 32B Dense 65K 8x B200

Qwen3 30B-A3B MoE 128K 8x B200

Qwen3 8B Dense 128K 4x B200

Qwen3 VL 8B Dense (Vision-Language) 65K 4x H200

Kimi K2.5 MoE 256K 64x B200

Kimi K2.5 LoRA MoE 256K 8x B300

Llama 3.3 70B Dense 128K 8x B200

MiniMax M2.5 MoE 192K 16x B200

Nemotron 3 Super 120B Hybrid Mamba-MoE 128K 16x B200

Nemotron Nano 3 30B-A3B Hybrid Mamba-MoE 262K 8x B200

Both policy trainer and forward-only reference shapes are available for every model, supporting full RL workflows with separate policy and reference deployments. This is, to our knowledge, the broadest set of fine-tunable frontier MoE models available on any training platform.

The two training modes present very different engineering challenges. LoRA freezes most of the model and updates a small set of low-rank adapters — the question is whether the full model even fits on a single node. Full-parameter training updates every weight — the question is how to distribute a trillion parameters, their gradients, their optimizer states, and their activations across a GPU cluster while keeping utilization high. We built the engine to handle both.

LoRA: Fitting a Trillion Parameters on One Node

LoRA fine-tuning of a 1T MoE model sounds like it should be easy — only a fraction of parameters are trainable. But the frozen base model still has to live in GPU memory. Kimi K2.5 has 384 MoE experts; in bfloat16, those experts alone consume the majority of an 8-GPU node's memory before a single gradient is computed.

Low-precision expert quantization makes it fit. We store frozen expert weights in a reduced-precision packed format, cutting expert memory by roughly 4x. The experts are dequantized to bf16 on the fly during the forward pass; because they are frozen, there is no loss of gradient precision. For Kimi K2.5, this is the difference between needing multiple nodes and fitting on a single 8-GPU node.

Optimizer state offloading reclaims more headroom. Optimizer state offloading between CPU and GPU reclaims significant memory headroom. On a Qwen3-30B MoE model (128 experts, 8 H200 GPUs), this reduces peak GPU memory by over 40% with no loss in throughput. Training results are bit-identical to the non-offloaded baseline.

Multi-session LoRA lets multiple clients independently load and hot-swap different LoRA adapters on the same shared frozen base model at runtime. Base-only handles are available for efficient reference model logprob computation in RL workflows. We validate zero state leakage across rapid adapter switches with verified cross-GPU parity.

Full-Parameter Training: Scaling Across a GPU Cluster

Full-parameter training updates every weight in the model — which means every weight also needs a gradient and a full set of optimizer states. For MoE architectures, expert counts that don't divide evenly into GPU counts create load-balancing challenges, and expert dispatch adds an all-to-all communication at every MoE layer. Dense models avoid the routing complexity but still hit memory walls at large scale. Making full-parameter training work across the range of architectures in our catalog — from 8B dense models on a single node to 1T MoE models on multi-node clusters — required solving problems across compute, memory, communication, and scheduling simultaneously.

Composable Parallelism

No single parallelism strategy covers every model shape. Our engine composes four dimensions from a single configuration — FSDP, Pipeline Parallelism, Context Parallelism, and Expert Parallelism — each addressing a different bottleneck. The Training SDK selects the right combination for each model: a dense 8B model may need only FSDP, while a 1T MoE at 256K context uses all four.

Composable ParallelismFour parallelism dimensions compose from a single configuration. FSDP shards parameters, pipeline parallelism splits layers, context parallelism shards the sequence, and expert parallelism routes MoE tokens.

We break down the workload across these dimensions:

Context Parallelism serves as the primary long-context scaling axis: it shards the sequence across nodes while preserving full hidden dimensions in projections, keeping matrix multiplications efficient.

Expert Parallelism uses DeepEP for high-throughput MoE token dispatch with minimal overhead.

For architectures that mix different attention mechanisms (e.g., full attention and linear attention), we support hybrid context parallelism that handles heterogeneous layer types within a single model, validated at 35B MoE scale with KL divergence below 0.008.

Blackwell-Native Precision

Full-parameter training is compute- and memory-intensive, so numerical precision matters. For MoE expert computation, we use MXFP8 native grouped GEMMs that leverage Blackwell's block-scaled tensor core matrix multiplications — the hardware dequantizes during the systolic-array multiply, not in a separate kernel.

On DeepSeek V3-class expert shapes (32 experts per rank, 7168 hidden dimension, 2048 intermediate), this delivers a significant speedup over BF16 across both forward and backward passes. Across all tested configurations including Qwen3-235B shapes, the speedup is consistent while maintaining end-to-end numerical fidelity: symmetric KL divergence stays below 0.0063 for every configuration, well within our 0.01 acceptance threshold.

For attention, we integrate FA4 (CuTeDSL) kernels designed for Blackwell's SM100 architecture, handling the native Multi-head Latent Attention (MLA) shapes used by DeepSeek V3 and Kimi K2.5 — QK dimension 192, V dimension 128 — in both forward and backward passes without padding or reshaping. We collaborated with the community on the Flash Attention 4 backward kernel for these specific dimensions.

We also support FP8 Quantization-Aware Training (QAT), where fake-quantization operations in training exactly mirror the inference engine's math. Models trained with QAT deploy at reduced precision with matching behavior — no post-training quantization surprises.

Optimized RL Loss Computation

Custom loss functions in the Training SDK use forward_backward_custom, which executes two model forward passes: one to extract per-token log-probabilities, and a second forward-backward pass that propagates gradients through a cross-entropy surrogate. This generality lets you implement any RL objective in Python — but it doubles the forward-pass cost.

For production RL algorithms — GRPO, DRO, DAPO, GSPO, CISPO, and standard SFT cross-entropy — we fuse the loss computation into the forward pass itself, eliminating the extra round trip.

from fireworks.training.sdk import FiretitanServiceClient

service = FiretitanServiceClient(base_url=endpoint, api_key=api_key) policy = service.create_training_client(base_model=base_model)

Custom path — arbitrary loss, 2 forward passes + 1 backward

def my_loss(data, logprobs_list): loss = custom_objective(logprobs_list, advantages) return loss, {"custom_metric": loss.item()}

result = policy.forward_backward_custom(data, my_loss)

Built-in fused path — 1 forward + 1 backward, loss computed on-device

result = policy.forward_backward(data, "ppo", {"clip_low_threshold": 0.8, "clip_high_threshold": 1.2})

Fused RL Loss ComputationThe standard two-pass approach recomputes the full forward pass to obtain loss. The fused path computes loss directly in a single forward-backward pass, yielding up to 2x speedup for PPO.

On a Qwen3.5-35B MoE model running on 8 H200 GPUs:

Algorithm Speedup

GRPO ~2x

DRO ~1.7x

DAPO ~1.4x

SFT ~1.3x

All fused losses are numerically identical to the two-pass reference at step one and remain within the noise floor of MoE routing non-determinism at subsequent steps.

Streaming Pipeline Parallelism

Our Training API uses an HTTP-based interface where data items are sent to the trainer online. Standard pipeline parallelism implementations assume batch-oriented data loading — a mismatch with RL workloads where rollout data arrives asynchronously. We redesigned our pipeline schedule to begin execution as data arrives, eliminating the batch-accumulation bottleneck.

Streaming Pipeline ParallelismAccumulated scheduling waits for a full batch before executing. Streaming execution begins immediately as data arrives, reducing first-result latency by over an order of magnitude.

The result is up to an order-of-magnitude improvement in first-result latency for RL workloads, depending on model size and traffic pattern — the benefit is largest when input QPS is low relative to batch size, which is common in RL rollout settings. Loss parity is exact: the streaming schedule produces the same gradients as the accumulated batch.

Architecture Coverage

Each model family required deep distributed-training engineering to bring up — and the numerical parity pitfalls we cataloged for MoE serving apply equally to training. Qwen3.5-35B alone required solving 9 distinct gradient correctness bugs across shared experts, router gates, GQA, and DeltaNet layers. Every model ships with SFT memorization validation confirming end-to-end numerical correctness.

What's Next

The training shapes available today extend to 256K tokens of context. We are actively pushing that frontier.

Ultra-Long Context Training

We have validated training of trillion-parameter MoE models at over one million tokens of context on GB200 GPU clusters. To our knowledge, no other published system has demonstrated MoE training at this combination of model scale and context length. The closest comparisons:

System Total Params Max Train Context Architecture

DeepSeek V3 671B 128K MoE

Llama 3.1 405B 128K Dense

Qwen3-235B 235B 262K MoE

Nemotron 3 Super 120B 1M Hybrid Mamba-MoE

While Nemotron 3 Super reaches 1M context, it does so at 120B total parameters — 8.5x smaller and built on a fundamentally different Mamba-Transformer hybrid architecture. DeepSeek V3 is

[truncated for AI cost control]