ParallelKernelBench tests whether LLMs can write fast multi-GPU CUDA kernels across 87 real workloads. The best model solves under a third, but a few generated kernels beat any public implementation.
ParallelKernelBench (PKB) includes 87 multi-GPU kernel generation problems from real codebases.
The best frontier model (GPT-5.5) solves under a third of problems in zero-shot setting, with only 22 faster than baseline.
We generated 12 landing pages with Kimi K2.7 Code and Claude Fable 5. Kimi cost 94% less and scored within a few points on every page. Open-source models are not only cheaper but genuinely competitive on quality, and the gap is closing fast.
Kimi K2.7 Code costs about 94% less than Claude Fable 5 for generating landing pages.
Quality scores show a small gap between Kimi and Fable, especially after using a design inspiration MCP.
Together AI has achieved ISO 27001:2022 certification from A-LIGN, validating its information security management system for enterprise-grade AI workloads, complementing existing SOC 2 controls.
ISO 27001:2022 certification awarded by A-LIGN Compliance and Security
Scope covers global platform, corporate HQ, and third-party data centers
Together AI optimizes MiniMax M3 serving with KV-block-major sparse attention, paged MSA decode, optimized index scoring, and a Rust-based multimodal gateway, achieving 81–125% throughput improvements across concurrency levels.
MiniMax M3 combines coding, agentic workflows, and multimodal reasoning with a 1M-token context window.
Together AI's kernel team developed KV-block-major sparse attention and integrated MSA with paged attention.
Together AI built the fastest speech-to-text stack on Artificial Analysis by treating ASR as a full-path systems problem, not just a GPU inference problem. This article details optimizations including TensorRT multi-profile encoders, conditional CUDA graphs, shared memory, evented I/O, and gc.freeze() to eliminate tail latency.
Together AI achieved fastest STT by optimizing the entire system path, not just GPU inference.
Key techniques: TensorRT multi-profile encoders, conditional CUDA graphs, zero-copy shared memory, and evented I/O.
On a production coding agent workload, Together Inference Engine delivers 31% more TPS than the next fastest OSS engine on the same hardware, and maintains 2× better TTFT at saturation. The gains come from full-stack optimization: ThunderMLA, custom kernel rewrites, and end-to-end profiling on real traffic.
Real-world inference benchmarks for coding agents: 31% more TPS than TensorRT-LLM, 2× better TTFT at saturation, and 76% lower cost than Claude Opus 4.6.
Full-stack optimization includes ThunderMLA fused kernel, custom kernel rewrites, and end-to-end profiling.
Together AI partners with Pearl Research Labs to launch a discounted Pearl-powered inference endpoint for Gemma-4-31B-it-pearl, using Proof of Useful Work to turn AI workloads into crypto emissions.
Together AI partners with Pearl Research Labs to offer a discounted inference endpoint.
The endpoint uses Proof of Useful Work to mine cryptocurrency while performing AI inference.
Violin is an open-source AI video translation tool combining speech recognition, LLM translation, and text-to-speech to make video content accessible across languages. It offers a web app, CLI, and agent skills, featuring a video-aware chat assistant and personalized voice selection. Built with Together API using models like Whisper, DeepSeek, and Cartesia, it's released under the MIT license.
Violin integrates ASR, LLM translation, and TTS for open-source video translation.
Supports web app, CLI, and agent skills for diverse users.
Voice finder helps developers search, match, filter, and audition 600+ voices across Together AI TTS models using natural-language prompts or uploaded audio samples.
Search, filter, and audition over 600 voices across leading TTS models
Find voices via natural language descriptions or audio sample uploads
Learn how to deploy any HuggingFace model in one session using Goose and Together's Dedicated Container Inference. Skip the setup complexity — one prompt gets your model running in a production-grade GPU environment on release day.
Use Goose and Together's Dedicated Container Inference to deploy models with zero lag on release day.
Author deployed Netflix's void-model with a single session and prompt.
DeepSeek-V4's hybrid attention design (CSA, HCA, SWA) compresses KV cache, turning million-token context from a model challenge into a serving-systems problem. Together AI's early bring-up on NVIDIA HGX B200 reveals how cache policy, prefix caching, and endpoint profiles impact long-context workloads.
DeepSeek-V4's compressed sparse attention (CSA) and heavily compressed attention (HCA) reduce KV cache size, but the inference engine must manage multiple cache layouts.
Sliding window attention (SWA) becomes a bottleneck at long context, requiring careful storage strategy.
As AI moves from research to production, the challenge for AI-native teams shifts from building models to running them — efficiently, reliably, and at scale. Inference costs account for 80-90% of total lifetime cost of a production AI system. Together AI's research (FlashAttention-4, ATLAS) and full-stack optimization enable efficient inference, improving unit economics for customers.
Inference costs dominate AI system economics, comprising 80-90% of total lifetime cost.
Together AI introduces FlashAttention-4 (up to 1.3× faster than cuDNN) and ATLAS (adaptive speculative decoding for 4× faster inference).
Together AI and Adaption partner to bring Together Fine-Tuning natively into Adaptive Data, helping teams optimize datasets, run fine-tuning, evaluate results, and deploy stronger open models.
Together AI partners with Adaption to integrate fine-tuning into Adaptive Data.
The partnership aims to simplify the workflow from data optimization to model deployment.
Together AI details their rapid response to the Linux kernel vulnerability Copy Fail (CVE-2026-31431), which allowed local unprivileged users a precise 4-byte write primitive via the AF_ALG interface, leading to privilege escalation. The team mitigated by unloading the vulnerable kernel module, rolling out patches, and enhancing detection, ensuring AI infrastructure security.
Copy Fail (CVE-2026-31431) is a logic bug in the Linux kernel's crypto subsystem allowing precise 4-byte writes to any readable file's page cache.
Together AI unloaded the algif_aead module and removed its file within hours, blocking exploits without reboot.
DeepSeek-V4 Pro, a 1.6T-parameter MoE reasoning model, is now available on Together AI with a 512K context window, controllable reasoning modes, and cached-input pricing for long-context workloads like code agents, document intelligence, and research synthesis.
1.6T-parameter MoE with 49B activated parameters, 512K context on Together AI (model supports 1M)
Three reasoning modes: Non-Think, Think High, Think Max to match effort to task
NVIDIA Nemotron 3 Nano Omni is now on Together AI: a single open model that reasons across video, images, audio, and text, built for agentic workloads at scale.
Nemotron 3 Nano Omni is a single open model for multimodal reasoning, using a Mamba-Transformer MoE architecture. It activates only ~3B parameters per token. It supports up to 256K tokens of shared context across modalities.
Together AI's optimizations, including FlashAttention-4, deliver high-throughput inference with low latency.
Rollout is the silent bottleneck in RL post-training. DAS fixes it with adaptive speculative decoding — up to 50% faster, zero degradation in reward quality.
DAS reduces RL rollout time by up to 50% without affecting reward quality.
It uses an adaptive suffix tree drafter that self-evolves from rollout history.
Multi-tenant GPU clusters let AI-native companies share compute capacity across teams without sacrificing isolation or control. This guide covers core design principles, common failure modes, and how Together AI implements multi-tenancy in practice.
Multi-tenant GPU clusters pool capacity while providing dedicated nodes, storage, and self-serve scheduling per team.
Three core requirements: pooled capacity, tenant isolation, and self-serve access.
Parcae is a stable looped language model that matches the quality of a Transformer twice its size — a 770M model reaching 1.3B-level performance. We introduce the first scaling laws for looping and show that increasing recurrence, not just data, is a compute-efficient path to better performance.
Parcae is a stable looped architecture with clean, predictable training.
A 770M Parcae model achieves performance comparable to a 1.3B Transformer, halving the parameter count.
EinsteinArena is a platform where AI agents collaborate and compete on open math problems. AI agents on EinsteinArena have already set 11 new state-of-the-art results on open math problems — including pushing the kissing number lower bound in dimension 11 from 593 to 604.
AI agents collaborate on EinsteinArena to solve open math problems.
Achieved 11 new SOTA results, including a major jump in kissing number bound.
AI-native companies need infrastructure built for models, not legacy workloads. Learn what defines an AI Native Cloud and why it matters for the next platform shift.
AI-native companies need infrastructure built for model-centric workflows, not legacy web apps.
Traditional clouds optimized for CPU workloads cannot meet the GPU-intensive, rapidly iterating needs of AI.