Cursor released Composer 2, a coding model optimized for the Cursor development environment. Based on Kimi 2.5, it combines continual pre-training and large-scale reinforcement learning to achieve frontier-level coding performance while reducing inference cost by 6-10x. Fireworks AI provides the distributed inference infrastructure to make RL scalable.
Composer 2 is a specialized coding model for Cursor's environment, improved through continual pre-training and reinforcement learning.
It achieves top scores on CursorBench, Terminal-Bench, and SWE-bench Multilingual.
This article presents a worker-advisor architecture that combines open-source worker agents with a closed-source advisor model, achieving near-frontier performance on multiple benchmarks at significantly lower costs. The GLM-5.2 + Opus 4.8 combination shows consistent improvements across SWE-bench Pro, Terminal-Bench 2.1, and Legal Agent Bench, with cost savings of 19% to 67% compared to using Opus alone as the worker.
An open-source worker (Kimi-K2.6 or GLM-5.2) drives the task end-to-end, consulting a closed-source frontier model (Claude Opus 4.8) once for review.
Lifts of +4 to +7 pp on SWE-bench Pro, +4 to +8 pp on Terminal-Bench 2.1, and +1 to +4 pp on Legal Agent Bench.
Fireworks AI is migrating all self-serve accounts to prepaid billing effective July 1, 2026. Users can switch now to control the timing or be automatically migrated. Prepaid billing offers predictability with credit purchases and auto-reload options. Contracted customers are not affected.
Fireworks AI moves to prepaid billing for self-serve accounts on July 1, 2026.
Users can either switch now or be migrated automatically on that date.
GLM 5.2, the latest open-source model from Z.ai (formerly Zhipu), is now available on Fireworks inference platform. It leads coding benchmarks, features a 1M-token context window for long-horizon tasks, and is MIT-licensed. Fireworks validates performance independently, emphasizing infrastructure over routing.
GLM 5.2 is now live on Fireworks inference, day zero.
It is the strongest open-source model for coding, with a 1M-token context.
Moonshot AI has released Kimi K2.7 Code, the latest coding model in the K2 line, now available on Fireworks AI with Day-0 support. The model uses 30% fewer reasoning tokens than its predecessor while achieving higher scores on coding benchmarks. This reduction in reasoning tokens significantly lowers the cost per completed task for agentic workflows. Fireworks offers three serving tiers: Standard, Priority, and Fast (coming soon), catering to different reliability and speed needs.
Kimi K2.7 Code uses 30% fewer reasoning tokens than K2.6 but scores higher on coding evals.
Lower reasoning tokens reduce overall cost per task in agentic workflows due to compounding effects.
Alibaba has partnered with Fireworks to host Qwen 3.7 Plus on its infrastructure, making the flagship multimodal model available via serverless API. Designed for agent loops, it supports thinking and non-thinking modes, a 262K context window, and offers a 50% price reduction over its predecessor. Fireworks provides direct inference with zero data retention and 99.9% uptime SLA.
Qwen 3.7 Plus is now available exclusively on Fireworks as a serverless API.
The model is built for agentic workflows, supporting image input and reasoning preservation across turns.
MiniMax releases flagship model M3 with over 500K token context window, native multimodality, and MiniMax Sparse Attention (MSA) architecture, delivering frontier-level coding and agentic capabilities at a fraction of the cost of previous models.
MiniMax M3 supports over 500K token context, expanding to 1M soon.
Uses MiniMax Sparse Attention (MSA) for sub-quadratic scaling, 4x faster than alternatives.
NVIDIA's Nemotron 3 Ultra, an open model optimized for long-running autonomous agents, launches with day-zero support on Fireworks. With 550B total parameters, hybrid Transformer-Mamba MoE architecture, and up to 1M context, it offers 5x faster inference and 30% lower cost for agentic tasks compared to other open models. Fireworks provides dedicated GPU deployments and a unified platform for training and inference.
Nemotron 3 Ultra is an open model designed for long-running autonomous agents, featuring 550B total and 55B active parameters.
It uses a hybrid Transformer-Mamba MoE architecture with up to 1M context length.
Fireworks AI and Harvey explore two system-level techniques on Legal Agent Benchmark (LAB) to reduce reliance on single frontier model calls while achieving frontier-level performance at lower cost. A hybrid harness with open-source GLM 5.1 worker and Claude Opus 4.7 advisor achieves 18/100 all-pass at $368, surpassing Opus alone (14/100 at $954). Post-training of Kimi K2.6 via SFT and RFT yields 15/100 all-pass at $84 and improved mean scores respectively.
Hybrid harness with open-source worker and frontier advisor as callable tool achieves higher all-pass at lower cost than end-to-end frontier model.
Post-training on Fireworks: SFT lifts all-pass from 11 to 15/100; RFT boosts mean score from 0.863 to 0.886.
Trilogy's AI Center of Excellence evaluated Fireworks AI as inference infrastructure to standardize open-weight model usage, reducing costs and enabling billion-token-scale agentic workflows.
Trilogy adopted Fireworks AI as the inference layer for enterprise open-weight models.
Reduced cost to ~1/5 of proprietary systems and eliminated rate limit issues.
A benchmark of 720 browser agent tasks reveals that structured output reliability, not raw intelligence, is the bottleneck in agentic AI. Gemini 2.5 Flash incurred a 22.9% execution tax due to malformed JSON, while Kimi K2.5 had zero. This tax compounds into higher latency, cost, and failure rates. The report introduces Reliability-Adjusted Accuracy and cost-per-successful-task metrics.
Agent Execution Tax measures wasted inference from structured output failures; top model had 22.9% tax.
Gemini 2.5 Flash had 86.7% probability of at least one parse retry per task; Kimi K2.5 had 0%.
Fireworks AI launches Serverless 2.0, offering Standard, Priority, and Fast inference paths through a single API without reserved capacity. The Priority path provides stronger request admission under congestion, while the Fast path delivers roughly 2x throughput. The update also clarifies error codes by separating load shedding (503) from rate limits (429), improving retry logic and alerting.
Serverless 2.0 introduces three serving intents: Standard (default), Priority (stronger admission under load), and Fast (higher token throughput).
Priority achieved 0% 503 error rate in peak-load testing versus 0.082% for Standard.
As a Tier 1 AWS Premier Partner, Innovative Solutions transformed its services delivery by migrating its inference layer to Fireworks AI. The DarcyIQ platform evolved from an internal productivity tool into a multi-agent execution system, compressing contract cycles from 30–45 days to ~3 days, doubling delivery throughput, and making inference costs predictable and controllable.
Innovative Solutions migrated its inference layer from Anthropic to Fireworks AI, reducing model integration overhead and achieving stable, cost-predictable inference.
DarcyIQ evolved into a multi-agent execution system covering sales, scoping, and delivery, cutting contract cycles to ~3 days.
Fireworks AI has acquired Hathora, a company specializing in low-latency container orchestration for gaming, to enhance its AI inference platform with millisecond-level routing and global scalability.
Fireworks AI acquires Hathora to improve AI inference latency.
Hathora's orchestration handles routing across 14 regions and multiple clouds.
Fireworks AI announces public preview of its high-performance open model inference on Microsoft Foundry, integrating leading models like DeepSeek V3.2 and Kimi K2.5 into Azure with BYOW and flexible pricing.
Fireworks AI on Microsoft Foundry brings best-in-class open model inference to Azure. Available models include DeepSeek V3.2, Kimi K2.5, and more. Supports bring-your-own-weights and serverless or provisioned throughput pricing.
Teams often think the training algorithm is the bottleneck in fine-tuning, but the real challenges are integration friction and slow iteration cycles. This article explores these bottlenecks through real-world examples from Genspark, Cursor, and more, and looks ahead to automated, agentic fine-tuning loops.
Integration and data sovereignty issues, not algorithms, are the main bottlenecks in fine-tuning.
Fast iteration cycles (from weeks to hours) are crucial for successful fine-tuning.
Fireworks AI launches Training Preview, an end-to-end platform for training and deploying frontier models at scale. It supports full-parameter training from Qwen3 8B to Kimi K2.5 (1T parameters), offers three interfaces (Training Agent, Managed Training, Training API), and demonstrates significant performance gains in RL, SFT, DPO, and classification tasks. The platform ensures numerical parity between training and inference, enabling teams to own truly customized models.
Full-parameter training at frontier scale, from 8B to 1T parameters, on the same platform as LoRA.
Three surfaces: Training Agent (no-code), Managed Training (for ML engineers), and Training API (full control).
Fireworks introduces safe_tokenization, a per-request boolean flag that prevents prompt injection by ensuring user content cannot be encoded as control tokens. The mechanism works by scanning the tokenizer vocabulary at model load and encoding user text segment-by-segment to break control-token strings into subwords. It is computationally cheap, transparent for benign inputs, and available across all supported open models.
Prompt injection arises because user text and control tokens share the same byte stream in standard tokenization pipelines, allowing user bytes to become structural tokens. Fireworks' safe_tokenization prevents this by splitting control-token strings into subwords during user content encoding.
The fix combines two steps: pre-processing the chat template at model load to separate control tokens from user interpolation, and at request time, encoding user text segments to avoid control token IDs.
DeepSeek V4 Pro is now live on Fireworks after a delayed launch due to a reasoning trace corruption bug. The article details the issue, its debugging, and validation process.
DeepSeek V4 Pro launch delayed due to reasoning trace corruption bug across early deployments.
Fireworks coordinated with SGLang, vLLM, and DeepSeek to fix the serving-path issue.
This article delves into numerical inconsistencies in Mixture-of-Experts (MoE) models between training and inference, caused by the non-associativity of floating-point addition. Through case studies from Kimi K2.5 and Qwen3.5-MoE, it reveals how all-reduce topology differences, fusion of communication with computation, and multi-operation fusions in MoE lead to numerical drift, and proposes solutions and measurement methods.
Non-associativity of floating-point addition is the root cause of numerical drift.
MoE models are more sensitive to tiny hidden-state changes due to routing, amplifying drift.
DeepSeek-V4's training system integrates architecture, routing, reward modeling, reasoning modes, distillation, and agent execution into a programmable loop. Key innovations include hybrid attention (CSA/HCA), anticipatory routing for stability, three reasoning modes from the same weights, generative reward models, on-policy distillation with full-vocabulary logits, and agentic training that pulls runtime into the loop. The trend points to fixed recipes giving way to programmable training infrastructure.
DeepSeek-V4 alternates Compressed Sparse Attention and Heavily Compressed Attention for long-context memory hierarchy.
Anticipatory Routing uses older router weights to prefetch routing decisions, preventing loss spikes.
Fireworks' blog post details how its Training SDK and optimizations (low-precision quantization, optimizer offloading, composable parallelism, Blackwell-native precision, and streaming pipeline parallelism) scale trillion-parameter MoE model training, supporting both LoRA and full-parameter modes across a wide model catalog.
Fireworks' Training SDK supports LoRA and full-parameter training for diverse MoE and dense models.
LoRA training fits trillion-parameter models on a single node via expert quantization and optimizer offloading.