Baseten Blog AI News Source

Public articles 18Collected articles 27Trust 82Refresh 120 min

Health HealthySource type OfficialFull-text rights Official full textLast ingested 2026-06-25ID baseten-blogStatus Enabled

Official AI inference and deployment platform blog; confirm reuse terms before full body display.

Latest public articles

AI training vs. inference: what's the difference?

2026-06-25 22:12 UTC

AI training teaches models to learn from data. Inference is what runs in production. This article explains the key differences in hardware, cost, and optimization, covering the model lifecycle from pretraining to serving, the four metrics for inference performance, and a comparison between the two phases.

Training is the process of a model learning from large datasets by adjusting its weights, requiring significant compute resources.
Inference occurs every time a trained model generates output for a user request, with no learning involved.

How to run GLM-5.2 in any harness

2026-06-25 22:12 UTC

GLM-5.2 is this year's DeepSeek moment, matching closed-source quality while being 4.5x faster and 5x cheaper. This article provides step-by-step instructions to set up GLM-5.2 in Claude Code, Codex, and Deep Agents CLI in under 5 minutes.

GLM-5.2 is a high-performance open-source model that can replace closed models like Opus 4.8
Configure Claude Code by editing environment variables to use GLM-5.2

NVIDIA BioNeMo Agent Toolkit on Baseten

2026-06-23 20:06 UTC

NVIDIA BioNeMo Agent Toolkit helps transform general-purpose AI agents into scientific agents capable of performing real biological and drug discovery tasks. The toolkit combines BioNeMo Skills, open models, NVIDIA NIM microservices, and agent infrastructure to enable workflows such as protein structure prediction, protein design, virtual screening, genomics analysis, and target discovery. All BioNeMo NIM microservices are available today in the Baseten Model Library, making it easy for developers to deploy and scale scientific AI applications.

NVIDIA BioNeMo Agent Toolkit transforms general AI agents into scientific agents for biology and drug discovery.
It integrates BioNeMo Skills, open models, NVIDIA NIM microservices, and agent orchestration infrastructure.

The best open-source large language models (LLMs)

2026-06-18 08:13 UTC

We compare 8 top open-source LLMs in production: DeepSeek V4 Pro, Gemma 4, GLM 5.1, GPT OSS 120B, Kimi K2.6, MiniMax M3, Nemotron 3 Ultra, and Qwen 3.6. Find the best model for agentic coding, long-context reasoning, cost, and speed.

Kimi K2.6 is the most well-rounded; Qwen 3.6 and GLM 5.1 lead for agentic coding; DeepSeek and Nemotron dominate long-context and enterprise workloads; GPT OSS 120B performs well on cost and speed.
DeepSeek V4 Pro offers a 1M-token context window with CSA and HCA reducing KV cache memory to ~2% of standard models.

Rolling deployments for zero-downtime model updates

2026-06-12 17:37 UTC

Baseten introduces rolling deployments, enabling teams to update models incrementally without downtime or doubled GPU costs. Replicas are replaced one at a time with gradual traffic shifting, plus pause, resume, and rollback controls. Customers report 50-60% more frequent deployments, eliminating off-peak manual babysitting.

Rolling deployments replace replicas step-by-step, avoiding blue-green's doubled GPU cost and hard cutover's all-or-nothing risk.
Two modes: max_surge (scale up new first) and max_unavailable (scale down old first), for latency or cost sensitivity.

Mercury 2, the first reasoning diffusion LLM, is now on Baseten

2026-06-12 14:13 UTC

Inception's Mercury 2, a diffusion LLM, is now available on Baseten. It generates over 1,000 tokens per second, 5-10x faster than leading speed-optimized models, at half the cost with comparable quality. It enables real-time speed on standard NVIDIA GPUs without custom chips. Augment Code cut costs by 90% and latency by 82% using Mercury 2.

Mercury 2 is the fastest reasoning LLM, using diffusion to generate full output in parallel passes.
It runs over 1,000 tokens per second on standard NVIDIA GPUs, reducing costs and latency.

Introducing NVIDIA Nemotron 3 Ultra: The Nemotron 3.x family is here!

2026-06-04 13:50 UTC

Nemotron 3 Ultra is a hybrid Mamba-transformer model designed for long-running agents, delivering up to 5x faster inference and 30% lower cost by replacing most attention with Mamba layers. Fully open, it enables agents to complete lengthy tasks efficiently without slowdown.

Nemotron 3 Ultra uses a hybrid architecture with mostly Mamba layers to maintain constant inference speed as context grows.
Achieves up to 5x faster inference and 30% lower cost for long-running agent workflows compared to open frontier models.

MAI-Thinking-1 is coming to Baseten

2026-06-02 19:45 UTC

Baseten and Microsoft AI announce that MAI-Thinking-1, a new flagship reasoning model, will be available on Baseten. It offers a unique balance between open-source flexibility and closed-model convenience, with clean data lineage, commercial-grade quality, and customization options.

MAI-Thinking-1 is Microsoft AI's new reasoning model that bridges open-source and proprietary models.
Trained on curated data without distillation, ensuring clean and auditable data lineage.

Nvidia Cosmos 3: Robots Finally Take Over

2026-06-01 05:41 UTC

NVIDIA's Cosmos 3 is a foundation model for physical AI, designed to help developers build robots and autonomous systems by understanding and simulating the physical world. It supports six modes and can act as a direct controller or a data factory to generate training data, addressing the data bottleneck in robotics.

Cosmos 3 is a world foundation model for physical AI, not just video generation.
It supports six modes: text2image, text2video, image2video, forward_dynamics, inverse_dynamics, policy.

Powering Inference for the Continual Learning Era

2026-05-28 01:32 UTC

Baseten and Trajectory have built a production-grade inference pipeline for continual learning, where models are continuously updated from production traces. The pipeline compresses the time from training to deployment to roughly one hour, enabling models that improve through usage.

Continual learning allows models to improve continuously from production usage rather than static releases.
Baseten and Trajectory developed a pipeline that merges LoRA adapters, validates, and deploys them with A/B routing and provenance tracking.

The beginner’s guide to open-source AI models

2026-05-27 13:31 UTC

An introductory guide to open-source AI models covering what they are, how they work, when to use them, and their advantages over closed-source models. Includes discussion of model weights, fine-tuning, cost savings, and strategic considerations.

Open-source models typically refer to open-weight models, allowing fine-tuning and self-hosting.
They offer 87% lower cost on average compared to closed-source models.

Sub-second image generation with Flux.2 and Qwen-Image

2026-05-19 00:06 UTC

Baseten optimized image generation for Flux.2 [dev] and Qwen-Image, achieving up to 2.3x and 1.6x speedups on NVIDIA Blackwell GPUs, and significant gains on Hopper GPUs, using quantization, optimized kernels, and runtime improvements.

Baseten delivers sub-second latency for Flux.2 [dev] on B200 GPUs with FP4 quantization (0.98s).
Optimizations include FP4/FP8 quantization, optimized attention kernels, and memory optimizations that eliminate CPU offload.

How to train custom EAGLE-3 heads for speculative decoding

2026-05-15 03:46 UTC

A comprehensive guide on training custom EAGLE-3 draft heads for speculative decoding to achieve 1.5-2.5x latency improvements in LLM inference without sacrificing output quality. Covers dataset preparation, hyperparameter tuning, training workflow, evaluation, and deployment.

EAGLE-3 is a speculative decoding method that uses a lightweight draft head to predict multiple future tokens, verified by the target model in a single pass.
Training requires regenerating outputs with the target model to align token distributions; dataset quality is critical.

Harnesses are everything. Here's how to optimize yours.

2026-05-15 03:46 UTC

Three universal patterns to optimize AI harnesses: keep .md files lean and human-written, use the R.P.I. (Research, Plan, Implement) framework for structured prompts, and employ subagents (parallel fan-out and pipelines) to maintain clean context. Emphasizes that the harness, not just the model, is where engineering judgment makes a difference, and advises committing to one harness and iterating rather than switching frequently.

Keep .md files lean and human-written; avoid LLM-generated system prompts that degrade performance and increase cost.
Use progressive disclosure for CLIs, skills, and MCP tools to reduce context overhead.

NVIDIA Nemotron 3 Nano Omni: Build multimodal agents on Baseten

2026-05-15 03:45 UTC

NVIDIA Nemotron 3 Nano Omni is an open multimodal foundation model that unifies audio, images, video, and text into a single context. Built on the Nemotron 3 Nano backbone, it powers sub-agents in agentic workflows with leading efficiency and accuracy. Baseten now supports the model with day-zero availability, high-performance inference, multi-cloud capacity management, and enterprise security.

Nemotron 3 Nano Omni is an open unified multimodal model combining audio, images, video, and text.
Features latent MoE, 3D convolutional layers, and efficient video sampling for improved efficiency.

Introducing the Baseten Frontier Gateway

2026-05-15 03:44 UTC

Baseten launches Frontier Gateway, a managed API gateway for AI labs to serve models under their own domain without building or buying a separate gateway, leveraging Baseten's inference infrastructure for performance and scalability.

Frontier Gateway is a managed multi-tenant API gateway on Baseten Dedicated Inference, supporting auth, rate limits, billing, and white-label branding.
It eliminates the need to build in-house or use third-party gateways, which suffer from latency and integration overhead.

DFlash: 3x faster LLM inference

2026-05-15 03:43 UTC

DFlash introduces block diffusion for speculative decoding, predicting multiple tokens in parallel to surpass EAGLE's ~2x speedup ceiling. Baseten's implementation achieves ~3x speedups on Qwen3-8B across benchmarks, 10-30% faster than vLLM.

DFlash predicts 8-16 tokens per forward pass using bidirectional attention, overcoming EAGLE's autoregressive bottleneck.
Baseten's DFlash implementation on Qwen3-8B delivers ~3x speedup on GSM8k, MATH-500, and Nemotron datasets.

Cost-efficient, high-performance TTS with Qwen3-TTS