Fireworks AI Blog AI News Source

Public articles 22Collected articles 24Trust 84Refresh 120 min

Health HealthySource type OfficialFull-text rights Official full textLast ingested 2026-06-26ID fireworks-blogStatus Enabled

Official AI inference and model platform blog; confirm reuse terms before full body display.

Latest public articles

Cursor Composer 2 + Fireworks AI

2026-06-26 22:15 UTC

Cursor released Composer 2, a coding model optimized for the Cursor development environment. Based on Kimi 2.5, it combines continual pre-training and large-scale reinforcement learning to achieve frontier-level coding performance while reducing inference cost by 6-10x. Fireworks AI provides the distributed inference infrastructure to make RL scalable.

Composer 2 is a specialized coding model for Cursor's environment, improved through continual pre-training and reinforcement learning.
It achieves top scores on CursorBench, Terminal-Bench, and SWE-bench Multilingual.

Frontier AI at a Fraction of the Cost: Open-Source Worker Agents with a Closed-Source Advisor

2026-06-26 18:14 UTC

This article presents a worker-advisor architecture that combines open-source worker agents with a closed-source advisor model, achieving near-frontier performance on multiple benchmarks at significantly lower costs. The GLM-5.2 + Opus 4.8 combination shows consistent improvements across SWE-bench Pro, Terminal-Bench 2.1, and Legal Agent Bench, with cost savings of 19% to 67% compared to using Opus alone as the worker.

An open-source worker (Kimi-K2.6 or GLM-5.2) drives the task end-to-end, consulting a closed-source frontier model (Claude Opus 4.8) once for review.
Lifts of +4 to +7 pp on SWE-bench Pro, +4 to +8 pp on Terminal-Bench 2.1, and +1 to +4 pp on Legal Agent Bench.

Fireworks AI

2026-06-19 21:55 UTC

Fireworks AI is migrating all self-serve accounts to prepaid billing effective July 1, 2026. Users can switch now to control the timing or be automatically migrated. Prepaid billing offers predictability with credit purchases and auto-reload options. Contracted customers are not affected.

Fireworks AI moves to prepaid billing for self-serve accounts on July 1, 2026.
Users can either switch now or be migrated automatically on that date.

GLM 5.2 is live on Fireworks inference, day zero.

2026-06-18 21:52 UTC

GLM 5.2, the latest open-source model from Z.ai (formerly Zhipu), is now available on Fireworks inference platform. It leads coding benchmarks, features a 1M-token context window for long-horizon tasks, and is MIT-licensed. Fireworks validates performance independently, emphasizing infrastructure over routing.

GLM 5.2 is now live on Fireworks inference, day zero.
It is the strongest open-source model for coding, with a 1M-token context.

Kimi K2.7 Code on Fireworks: Better Agents, Lower Cost per Task, Available Day-0 | Fireworks AI

2026-06-16 21:46 UTC

Moonshot AI has released Kimi K2.7 Code, the latest coding model in the K2 line, now available on Fireworks AI with Day-0 support. The model uses 30% fewer reasoning tokens than its predecessor while achieving higher scores on coding benchmarks. This reduction in reasoning tokens significantly lowers the cost per completed task for agentic workflows. Fireworks offers three serving tiers: Standard, Priority, and Fast (coming soon), catering to different reliability and speed needs.

Kimi K2.7 Code uses 30% fewer reasoning tokens than K2.6 but scores higher on coding evals.
Lower reasoning tokens reduce overall cost per task in agentic workflows due to compounding effects.

Qwen 3.7 Plus is now live on Fireworks

2026-06-13 05:37 UTC

Alibaba has partnered with Fireworks to host Qwen 3.7 Plus on its infrastructure, making the flagship multimodal model available via serverless API. Designed for agent loops, it supports thinking and non-thinking modes, a 262K context window, and offers a 50% price reduction over its predecessor. Fireworks provides direct inference with zero data retention and 99.9% uptime SLA.

Qwen 3.7 Plus is now available exclusively on Fireworks as a serverless API.
The model is built for agentic workflows, supporting image input and reasoning preservation across turns.

MiniMax M3 is live: long context + native multimodality at 1/20th the price

2026-06-13 05:36 UTC

MiniMax releases flagship model M3 with over 500K token context window, native multimodality, and MiniMax Sparse Attention (MSA) architecture, delivering frontier-level coding and agentic capabilities at a fraction of the cost of previous models.

MiniMax M3 supports over 500K token context, expanding to 1M soon.
Uses MiniMax Sparse Attention (MSA) for sub-quadratic scaling, 4x faster than alternatives.

NVIDIA Nemotron 3 Ultra is live on Fireworks, day zero

2026-06-12 17:37 UTC

NVIDIA's Nemotron 3 Ultra, an open model optimized for long-running autonomous agents, launches with day-zero support on Fireworks. With 550B total parameters, hybrid Transformer-Mamba MoE architecture, and up to 1M context, it offers 5x faster inference and 30% lower cost for agentic tasks compared to other open models. Fireworks provides dedicated GPU deployments and a unified platform for training and inference.

Nemotron 3 Ultra is an open model designed for long-running autonomous agents, featuring 550B total and 55B active parameters.
It uses a hybrid Transformer-Mamba MoE architecture with up to 1M context length.

Open-source agents with frontier advisors: matching frontier performance through training and harness engineering

2026-06-04 17:51 UTC

Fireworks AI and Harvey explore two system-level techniques on Legal Agent Benchmark (LAB) to reduce reliance on single frontier model calls while achieving frontier-level performance at lower cost. A hybrid harness with open-source GLM 5.1 worker and Claude Opus 4.7 advisor achieves 18/100 all-pass at $368, surpassing Opus alone (14/100 at $954). Post-training of Kimi K2.6 via SFT and RFT yields 15/100 all-pass at $84 and improved mean scores respectively.

Hybrid harness with open-source worker and frontier advisor as callable tool achieves higher all-pass at lower cost than end-to-end frontier model.
Post-training on Fireworks: SFT lifts all-pass from 11 to 15/100; RFT boosts mean score from 0.863 to 0.886.

Trilogy Validates Open-Weight AI Models for Enterprise Workloads with Fireworks AI

2026-06-03 17:47 UTC

Trilogy's AI Center of Excellence evaluated Fireworks AI as inference infrastructure to standardize open-weight model usage, reducing costs and enabling billion-token-scale agentic workflows.

Trilogy adopted Fireworks AI as the inference layer for enterprise open-weight models.
Reduced cost to ~1/5 of proprietary systems and eliminated rate limit issues.

Agent Execution Tax

2026-06-01 17:42 UTC

A benchmark of 720 browser agent tasks reveals that structured output reliability, not raw intelligence, is the bottleneck in agentic AI. Gemini 2.5 Flash incurred a 22.9% execution tax due to malformed JSON, while Kimi K2.5 had zero. This tax compounds into higher latency, cost, and failure rates. The report introduces Reliability-Adjusted Accuracy and cost-per-successful-task metrics.

Agent Execution Tax measures wasted inference from structured output failures; top model had 22.9% tax.
Gemini 2.5 Flash had 86.7% probability of at least one parse retry per task; Kimi K2.5 had 0%.

Serverless 2.0: Three Ways to Run Inference, One API

2026-05-29 01:34 UTC

Fireworks AI launches Serverless 2.0, offering Standard, Priority, and Fast inference paths through a single API without reserved capacity. The Priority path provides stronger request admission under congestion, while the Fast path delivers roughly 2x throughput. The update also clarifies error codes by separating load shedding (503) from rate limits (429), improving retry logic and alerting.

Serverless 2.0 introduces three serving intents: Standard (default), Priority (stronger admission under load), and Fast (higher token throughput).
Priority achieved 0% 503 error rate in peak-load testing versus 0.082% for Standard.

Innovative Solutions Rebuilds Enterprise Services Delivery with Fireworks AI

2026-05-21 00:15 UTC

As a Tier 1 AWS Premier Partner, Innovative Solutions transformed its services delivery by migrating its inference layer to Fireworks AI. The DarcyIQ platform evolved from an internal productivity tool into a multi-agent execution system, compressing contract cycles from 30–45 days to ~3 days, doubling delivery throughput, and making inference costs predictable and controllable.

Innovative Solutions migrated its inference layer from Anthropic to Fireworks AI, reducing model integration overhead and achieving stable, cost-predictable inference.
DarcyIQ evolved into a multi-agent execution system covering sales, scoping, and delivery, cutting contract cycles to ~3 days.

Fireworks AI Acquires Hathora to Accelerate Global Compute Orchestration

2026-05-15 02:31 UTC

Fireworks AI has acquired Hathora, a company specializing in low-latency container orchestration for gaming, to enhance its AI inference platform with millisecond-level routing and global scalability.

Fireworks AI acquires Hathora to improve AI inference latency.
Hathora's orchestration handles routing across 14 regions and multiple clouds.

Fireworks AI

2026-05-15 02:30 UTC

Fireworks AI announces public preview of its high-performance open model inference on Microsoft Foundry, integrating leading models like DeepSeek V3.2 and Kimi K2.5 into Azure with BYOW and flexible pricing.

Fireworks AI on Microsoft Foundry brings best-in-class open model inference to Azure. Available models include DeepSeek V3.2, Kimi K2.5, and more. Supports bring-your-own-weights and serverless or provisioned throughput pricing.

The Fine-Tuning Bottleneck Isn't the Algorithm

2026-05-15 02:29 UTC

Teams often think the training algorithm is the bottleneck in fine-tuning, but the real challenges are integration friction and slow iteration cycles. This article explores these bottlenecks through real-world examples from Genspark, Cursor, and more, and looks ahead to automated, agentic fine-tuning loops.

Integration and data sovereignty issues, not algorithms, are the main bottlenecks in fine-tuning.
Fast iteration cycles (from weeks to hours) are crucial for successful fine-tuning.

Own Your AI: Fireworks Training Preview

2026-05-15 02:28 UTC

Fireworks AI launches Training Preview, an end-to-end platform for training and deploying frontier models at scale. It supports full-parameter training from Qwen3 8B to Kimi K2.5 (1T parameters), offers three interfaces (Training Agent, Managed Training, Training API), and demonstrates significant performance gains in RL, SFT, DPO, and classification tasks. The platform ensures numerical parity between training and inference, enabling teams to own truly customized models.

Full-parameter training at frontier scale, from 8B to 1T parameters, on the same platform as LoRA.
Three surfaces: Training Agent (no-code), Managed Training (for ML engineers), and Training API (full control).

How we fixed prompt injection for all models on Fireworks

2026-05-15 02:28 UTC

Fireworks introduces safe_tokenization, a per-request boolean flag that prevents prompt injection by ensuring user content cannot be encoded as control tokens. The mechanism works by scanning the tokenizer vocabulary at model load and encoding user text segment-by-segment to break control-token strings into subwords. It is computationally cheap, transparent for benign inputs, and available across all supported open models.

Prompt injection arises because user text and control tokens share the same byte stream in standard tokenization pipelines, allowing user bytes to become structural tokens. Fireworks' safe_tokenization prevents this by splitting control-token strings into subwords during user content encoding.
The fix combines two steps: pre-processing the chat template at model load to separate control tokens from user interpolation, and at request time, encoding user text segments to avoid control token IDs.

DeepSeek V4 Pro: Validating Frontier Models for Production

2026-05-15 02:27 UTC

DeepSeek V4 Pro is now live on Fireworks after a delayed launch due to a reasoning trace corruption bug. The article details the issue, its debugging, and validation process.

DeepSeek V4 Pro launch delayed due to reasoning trace corruption bug across early deployments.
Fireworks coordinated with SGLang, vLLM, and DeepSeek to fix the serving-path issue.

Training-Inference Parity in MoE Models: Where Numerics Drift

2026-05-15 02:26 UTC

This article delves into numerical inconsistencies in Mixture-of-Experts (MoE) models between training and inference, caused by the non-associativity of floating-point addition. Through case studies from Kimi K2.5 and Qwen3.5-MoE, it reveals how all-reduce topology differences, fusion of communication with computation, and multi-operation fusions in MoE lead to numerical drift, and proposes solutions and measurement methods.

Non-associativity of floating-point addition is the root cause of numerical drift.
MoE models are more sensitive to tiny hidden-state changes due to routing, amplifying drift.

Notes on DeepSeek-V4's training system

2026-05-15 02:24 UTC

DeepSeek-V4's training system integrates architecture, routing, reward modeling, reasoning modes, distillation, and agent execution into a programmable loop. Key innovations include hybrid attention (CSA/HCA), anticipatory routing for stability, three reasoning modes from the same weights, generative reward models, on-policy distillation with full-vocabulary logits, and agentic training that pulls runtime into the loop. The trend points to fixed recipes giving way to programmable training infrastructure.

DeepSeek-V4 alternates Compressed Sparse Attention and Heavily Compressed Attention for long-context memory hierarchy.
Anticipatory Routing uses older router weights to prefetch routing decisions, preventing loss spikes.

Scaling and Optimizing Frontier Model Training

2026-05-15 02:24 UTC

Fireworks' blog post details how its Training SDK and optimizations (low-precision quantization, optimizer offloading, composable parallelism, Blackwell-native precision, and streaming pipeline parallelism) scale trillion-parameter MoE model training, supporting both LoRA and full-parameter modes across a wide model catalog.

Fireworks' Training SDK supports LoRA and full-parameter training for diverse MoE and dense models.
LoRA training fits trillion-parameter models on a single node via expert quantization and optimizer offloading.

Fireworks AI Blog

Latest public articles

Cursor Composer 2 + Fireworks AI

Frontier AI at a Fraction of the Cost: Open-Source Worker Agents with a Closed-Source Advisor

Fireworks AI

GLM 5.2 is live on Fireworks inference, day zero.

Kimi K2.7 Code on Fireworks: Better Agents, Lower Cost per Task, Available Day-0 | Fireworks AI

Qwen 3.7 Plus is now live on Fireworks

MiniMax M3 is live: long context + native multimodality at 1/20th the price

NVIDIA Nemotron 3 Ultra is live on Fireworks, day zero

Open-source agents with frontier advisors: matching frontier performance through training and harness engineering

Trilogy Validates Open-Weight AI Models for Enterprise Workloads with Fireworks AI

Agent Execution Tax

Serverless 2.0: Three Ways to Run Inference, One API

Innovative Solutions Rebuilds Enterprise Services Delivery with Fireworks AI

Fireworks AI Acquires Hathora to Accelerate Global Compute Orchestration

Fireworks AI

The Fine-Tuning Bottleneck Isn't the Algorithm

Own Your AI: Fireworks Training Preview

How we fixed prompt injection for all models on Fireworks

DeepSeek V4 Pro: Validating Frontier Models for Production

Training-Inference Parity in MoE Models: Where Numerics Drift

Notes on DeepSeek-V4's training system

Scaling and Optimizing Frontier Model Training

All sources