AI News HubLIVE
Public articles 21Collected articles 21Trust 88Refresh 5 min
Health HealthySource type OfficialFull-text rights Official full textLast ingested 2026-06-23ID together-ai-blogStatus Enabled

Official source; confirm reuse terms before enabling full body display.

Latest public articles

ParallelKernelBench: Frontier LLMs can't write fast multi-GPU kernels (yet)

ParallelKernelBench tests whether LLMs can write fast multi-GPU CUDA kernels across 87 real workloads. The best model solves under a third, but a few generated kernels beat any public implementation.

  • ParallelKernelBench (PKB) includes 87 multi-GPU kernel generation problems from real codebases.
  • The best frontier model (GPT-5.5) solves under a third of problems in zero-shot setting, with only 22 faster than baseline.
In-site article

Kimi K2.7 Code vs Claude Fable 5: Landing pages that cost 94% less

We generated 12 landing pages with Kimi K2.7 Code and Claude Fable 5. Kimi cost 94% less and scored within a few points on every page. Open-source models are not only cheaper but genuinely competitive on quality, and the gap is closing fast.

  • Kimi K2.7 Code costs about 94% less than Claude Fable 5 for generating landing pages.
  • Quality scores show a small gap between Kimi and Fable, especially after using a design inspiration MCP.
In-site article

Building trust in enterprise AI: Together AI earns ISO 27001:2022 certification

Together AI has achieved ISO 27001:2022 certification from A-LIGN, validating its information security management system for enterprise-grade AI workloads, complementing existing SOC 2 controls.

  • ISO 27001:2022 certification awarded by A-LIGN Compliance and Security
  • Scope covers global platform, corporate HQ, and third-party data centers
In-site article

Serving MiniMax-M3 for efficient inference: Unlocking 1M-Token Context and Multimodality Without Regrets

Together AI optimizes MiniMax M3 serving with KV-block-major sparse attention, paged MSA decode, optimized index scoring, and a Rust-based multimodal gateway, achieving 81–125% throughput improvements across concurrency levels.

  • MiniMax M3 combines coding, agentic workflows, and multimodal reasoning with a 1M-token context window.
  • Together AI's kernel team developed KV-block-major sparse attention and integrated MSA with paged attention.
In-site article

How Together AI built the world’s fastest speech-to-text stack

Together AI built the fastest speech-to-text stack on Artificial Analysis by treating ASR as a full-path systems problem, not just a GPU inference problem. This article details optimizations including TensorRT multi-profile encoders, conditional CUDA graphs, shared memory, evented I/O, and gc.freeze() to eliminate tail latency.

  • Together AI achieved fastest STT by optimizing the entire system path, not just GPU inference.
  • Key techniques: TensorRT multi-profile encoders, conditional CUDA graphs, zero-copy shared memory, and evented I/O.
In-site article

Benchmarking inference at scale: coding agents

On a production coding agent workload, Together Inference Engine delivers 31% more TPS than the next fastest OSS engine on the same hardware, and maintains 2× better TTFT at saturation. The gains come from full-stack optimization: ThunderMLA, custom kernel rewrites, and end-to-end profiling on real traffic.

  • Real-world inference benchmarks for coding agents: 31% more TPS than TensorRT-LLM, 2× better TTFT at saturation, and 76% lower cost than Claude Opus 4.6.
  • Full-stack optimization includes ThunderMLA fused kernel, custom kernel rewrites, and end-to-end profiling.
In-site article

Together AI and Pearl Research Labs Team Up to Reduce the Cost of AI Inference

Together AI partners with Pearl Research Labs to launch a discounted Pearl-powered inference endpoint for Gemma-4-31B-it-pearl, using Proof of Useful Work to turn AI workloads into crypto emissions.

  • Together AI partners with Pearl Research Labs to offer a discounted inference endpoint.
  • The endpoint uses Proof of Useful Work to mine cryptocurrency while performing AI inference.
In-site article

Violin: An open-source video translation skill that breaks language barriers

Violin is an open-source AI video translation tool combining speech recognition, LLM translation, and text-to-speech to make video content accessible across languages. It offers a web app, CLI, and agent skills, featuring a video-aware chat assistant and personalized voice selection. Built with Together API using models like Whisper, DeepSeek, and Cartesia, it's released under the MIT license.

  • Violin integrates ASR, LLM translation, and TTS for open-source video translation.
  • Supports web app, CLI, and agent skills for diverse users.
In-site article

Deploy and inference any model from HuggingFace

Learn how to deploy any HuggingFace model in one session using Goose and Together's Dedicated Container Inference. Skip the setup complexity — one prompt gets your model running in a production-grade GPU environment on release day.

  • Use Goose and Together's Dedicated Container Inference to deploy models with zero lag on release day.
  • Author deployed Netflix's void-model with a single session and prompt.
In-site article

Serving DeepSeek-V4: why million-token context is an inference systems problem

DeepSeek-V4's hybrid attention design (CSA, HCA, SWA) compresses KV cache, turning million-token context from a model challenge into a serving-systems problem. Together AI's early bring-up on NVIDIA HGX B200 reveals how cache policy, prefix caching, and endpoint profiles impact long-context workloads.

  • DeepSeek-V4's compressed sparse attention (CSA) and heavily compressed attention (HCA) reduce KV cache size, but the inference engine must manage multiple cache layouts.
  • Sliding window attention (SWA) becomes a bottleneck at long context, requiring careful storage strategy.
In-site article

Foundational research powering efficient inference at scale

As AI moves from research to production, the challenge for AI-native teams shifts from building models to running them — efficiently, reliably, and at scale. Inference costs account for 80-90% of total lifetime cost of a production AI system. Together AI's research (FlashAttention-4, ATLAS) and full-stack optimization enable efficient inference, improving unit economics for customers.

  • Inference costs dominate AI system economics, comprising 80-90% of total lifetime cost.
  • Together AI introduces FlashAttention-4 (up to 1.3× faster than cuDNN) and ATLAS (adaptive speculative decoding for 4× faster inference).
In-site article

Announcing Together AI and Adaption Partnership

Together AI and Adaption partner to bring Together Fine-Tuning natively into Adaptive Data, helping teams optimize datasets, run fine-tuning, evaluate results, and deploy stronger open models.

  • Together AI partners with Adaption to integrate fine-tuning into Adaptive Data.
  • The partnership aims to simplify the workflow from data optimization to model deployment.
In-site article

From 732 bytes to nowhere: shutting down Copy Fail in production

Together AI details their rapid response to the Linux kernel vulnerability Copy Fail (CVE-2026-31431), which allowed local unprivileged users a precise 4-byte write primitive via the AF_ALG interface, leading to privilege escalation. The team mitigated by unloading the vulnerable kernel module, rolling out patches, and enhancing detection, ensuring AI infrastructure security.

  • Copy Fail (CVE-2026-31431) is a logic bug in the Linux kernel's crypto subsystem allowing precise 4-byte writes to any readable file's page cache.
  • Together AI unloaded the algif_aead module and removed its file within hours, blocking exploits without reboot.
In-site article

DeepSeek-V4 Pro now available on Together AI

DeepSeek-V4 Pro, a 1.6T-parameter MoE reasoning model, is now available on Together AI with a 512K context window, controllable reasoning modes, and cached-input pricing for long-context workloads like code agents, document intelligence, and research synthesis.

  • 1.6T-parameter MoE with 49B activated parameters, 512K context on Together AI (model supports 1M)
  • Three reasoning modes: Non-Think, Think High, Think Max to match effort to task
In-site article

Together AI Brings NVIDIA Nemotron 3 Nano Omni to Developers on Day 0

NVIDIA Nemotron 3 Nano Omni is now on Together AI: a single open model that reasons across video, images, audio, and text, built for agentic workloads at scale.

  • Nemotron 3 Nano Omni is a single open model for multimodal reasoning, using a Mamba-Transformer MoE architecture. It activates only ~3B parameters per token. It supports up to 256K tokens of shared context across modalities.
  • Together AI's optimizations, including FlashAttention-4, deliver high-throughput inference with low latency.
In-site article

Accelerate RL rollouts by up to 50% with distribution-aware speculative decoding

Rollout is the silent bottleneck in RL post-training. DAS fixes it with adaptive speculative decoding — up to 50% faster, zero degradation in reward quality.

  • DAS reduces RL rollout time by up to 50% without affecting reward quality.
  • It uses an adaptive suffix tree drafter that self-evolves from rollout history.
In-site article

Capacity without conflict: A guide to multi-tenant GPU cluster design for AI-native teams

Multi-tenant GPU clusters let AI-native companies share compute capacity across teams without sacrificing isolation or control. This guide covers core design principles, common failure modes, and how Together AI implements multi-tenancy in practice.

  • Multi-tenant GPU clusters pool capacity while providing dedicated nodes, storage, and self-serve scheduling per team.
  • Three core requirements: pooled capacity, tenant isolation, and self-serve access.
In-site article

Parcae: Doing more with fewer parameters using stable looped models

Parcae is a stable looped language model that matches the quality of a Transformer twice its size — a 770M model reaching 1.3B-level performance. We introduce the first scaling laws for looping and show that increasing recurrence, not just data, is a compute-efficient path to better performance.

  • Parcae is a stable looped architecture with clean, predictable training.
  • A 770M Parcae model achieves performance comparable to a 1.3B Transformer, halving the parameter count.
In-site article

EinsteinArena: Harnessing the collective intelligence of agents in the wild to advance science

EinsteinArena is a platform where AI agents collaborate and compete on open math problems. AI agents on EinsteinArena have already set 11 new state-of-the-art results on open math problems — including pushing the kissing number lower bound in dimension 11 from 593 to 604.

  • AI agents collaborate on EinsteinArena to solve open math problems.
  • Achieved 11 new SOTA results, including a major jump in kissing number bound.
In-site article

What is an AI Native Cloud?

AI-native companies need infrastructure built for models, not legacy workloads. Learn what defines an AI Native Cloud and why it matters for the next platform shift.

  • AI-native companies need infrastructure built for model-centric workflows, not legacy web apps.
  • Traditional clouds optimized for CPU workloads cannot meet the GPU-intensive, rapidly iterating needs of AI.
In-site article

All sources