AI News HubLIVE

Inference Cost updates

Stereo Vision-Based Fall Prediction and Detection using Human Pose Estimation on the AMD Kria K26 SOM

This paper presents a portable, low-power, battery-operated vision-based fall prediction and detection system using human pose estimation on an AMD Kria K26 SOM. The system uses an Intel RealSense D455 camera and a three-stage pipeline (quantized YOLOX, A2J, and CNN) to achieve real-time, privacy-preserving fall detection on the edge. Results show 4.5 FPS throughput with 75.85% classification accuracy.

  • Privacy-preserving fall detection system implemented on AMD Kria K26 edge device
  • Three-stage pipeline: YOLOX for human detection, A2J for joint estimation, CNN for fall classification
In-site article

Arbor: Tree Search as a Cognition Layer for Autonomous Agents

Arbor is a multi-agent framework introducing structured tree search as a cognition layer for autonomous agents in large stateful action spaces. Validated on full-stack LLM inference optimization, it achieves up to 193% Pareto improvement in throughput-latency over vendor baselines, with a critic agent ensuring stability.

  • Arbor uses tree search as shared working memory across agents for coordinated optimization.
  • Achieves up to 193% throughput-latency Pareto improvement on full-stack LLM inference, hardware-agnostic.
In-site article

Enabling KV Caching of Shared Prefix for Diffusion Language Models

Diffusion Language Models (DLMs) suffer from bidirectional attention causing failure of existing KV caching methods, leading to near-zero accuracy. The proposed bidirectional prefix caching (bicache) dynamically identifies safe layer depths to reuse shared prefix KVs, improving throughput by 36.3%-98.3% with only 0-1.8% accuracy degradation.

  • Existing LLM prefix caching fails in DLMs due to bidirectional attention that dynamically alters context and KVs.
  • Bicache leverages the observation that shared prefix KVs remain stable in shallow layers, with depth determined by the fraction of shared prefix tokens.
In-site article

Xiaomi MiMo and TileRT Push a 1-Trillion-Parameter Model Past 1000 Tokens Per Second on Commodity GPUs

Xiaomi's MiMo team, with TileRT, released MiMo-V2.5-Pro-UltraSpeed, a serving mode for the MiMo-V2.5-Pro model. It decodes over 1000 tokens per second on a 1-trillion-parameter model using a single 8-GPU commodity node. The speedup comes from FP4 quantization, DFlash speculative decoding, and the TileRT runtime. API trial runs June 9–23, 2026.

  • 1T-parameter MoE model achieves 1000+ tokens/sec on commodity GPUs
  • Three coordinated techniques: FP4 quantization, DFlash speculative decoding, TileRT runtime
In-site article

FAIR-Calib: Frontier-Aware Instability-Reweighted Calibration for Post-Training Quantization of Diffusion Large Language Models

Diffusion Large Language Models (dLLMs) refine tokens iteratively but commit them irreversibly, causing a "stability lag" where early decisions remain fragile. Post-Training Quantization (PTQ) error easily flips these borderline decisions at the write frontier, locking them in. FAIR-Calib, a two-stage PTQ framework, probes a full-precision teacher for a position prior and performs off-policy layer-wise calibration with a reweighted hidden-state MSE, protecting fragile frontier states without expensive end-to-end rollouts. Theoretically justified as a surrogate for output KL divergence, FAIR-Calib outperforms baselines on LLaDA and Dream (W4A4), reducing frontier flips and post-commit mismatches.

  • Diffusion LLMs suffer from stability lag where early token decisions are fragile to quantization error
  • FAIR-Calib introduces a two-stage PTQ framework with frontier-aware instability reweighting
In-site article

Accelerated Fourier SAT (AFSAT): Fully Realising a GPU-based Symmetric Pseudo-Boolean SAT Solver

We present Accelerated Fourier SAT (AFSAT), a GPU-accelerated solver for pseudo-Boolean satisfiability based on continuous local search (CLS). AFSAT realises the proof-of-concept approach, FastFourierSAT, into a fully-engineered solver supporting any heterogeneous mixture of symmetric constraint types and lengths within a single problem instance. Using the JAX compiler, AFSAT leverages pure function composition, automatic vectorisation, automatic differentiation, and just-in-time (JIT) compilation to perform massively parallel CLS across batches of candidate assignments. We demonstrate substantially improved numerical stability, runtime performance, and memory efficiency over the proof-of-concept. We achieve this by way of identifying and addressing various limitations that arise from memory latency and floating-point representation, as well as leveraging automatic parallelisation and compact representations. The inherent representational and stability limitations of floating point are partially addressed by a tailored discrete Fourier transform implementation. We achieve near-linear throughput when scaling to multiple accelerators via JAX array sharding.

  • AFSAT is a GPU-accelerated pseudo-Boolean SAT solver using continuous local search, improving upon FastFourierSAT.
  • It leverages JAX for massive parallelization via function composition, vectorization, differentiation, and JIT compilation.
In-site article

Show HN: Best setup local LLM found for a 5090 (llama.cpp fork + turboquant)

This article details the configuration and memory calibration required to run the Qwen 3.6 35B MoE model at a 450,000 token context window on a single 32GB VRAM GPU (NVIDIA RTX 5090) using llama.cpp with TurboQuant and YaRN scaling. It covers model selection, quantization trade-offs, KV cache quantization, RoPE scaling, multimodal setup, replication guide, VRAM lifecycle management, and performance evaluation.

  • Run Qwen3.6-35B-A3B-Q6_K on a single RTX 5090 with 450K context using llama.cpp TurboQuant fork and YaRN scaling.
  • Achieve 450K context by compressing KV cache to 3-bit (turbo3) and extending RoPE beyond native 262K with YaRN, but at cost of perplexity and retrieval accuracy.
In-site article

Roblox Released the Biggest AI World Model in Gaming. Everyone Hates It

Roblox acquired Morpheus AI and formed Roblox Labs, then quietly released a world model game called World Research Station which received a 3% rating due to poor performance, high latency, and glitches. The article criticizes Roblox for rushing out immature technology that may harm the field's reputation.

  • Roblox acquired Morpheus AI and formed Roblox Labs to showcase AI world models.
  • Quietly released World Research Station, which suffered from terrible performance and earned a 3% rating.
In-site article

Google DeepMind Releases Gemma 4 QAT Checkpoints: Q4_0 and a New Mobile Format Cut On-Device Memory

Google DeepMind released Quantization-Aware Training checkpoints for Gemma 4, targeting edge devices and consumer GPUs. This comparison of BF16, Q4_0 QAT, and the new mobile QAT format focuses on memory footprint, quality preservation, and deployment suitability using published data.

  • Q4_0 QAT reduces E2B memory from 9.6 GB (BF16) to 3.2 GB, and E4B from 15 GB to 5 GB.
  • The new mobile QAT format brings E2B to ~1 GB; text-only goes under 1 GB.
In-site article

Gemma 4 QAT models: Optimizing compression for mobile and laptop efficiency

Google releases new Gemma 4 checkpoints optimized with Quantization-Aware Training (QAT) to reduce memory usage and enable local deployment on edge devices and consumer GPUs. The models include a custom mobile quantization format that cuts memory footprint to 1GB for the E2B model.

  • QAT integration during training minimizes quality loss from compression.
  • Custom mobile quantization schema includes static activations, channel-wise quantization, targeted 2-bit quantization, and embedding/KV cache optimization.
In-site article

NVIDIA AI Releases Dynamo Snapshot: A CRIU-Based Fast Startup System for AI Inference on Kubernetes

NVIDIA introduces Dynamo Snapshot, a checkpoint/restore approach using CRIU and cuda-checkpoint to drastically reduce cold-start latency for AI inference workloads on Kubernetes, achieving startup times from minutes to seconds with optimizations including KV cache unmapping, parallel memfd restore, Linux native AIO, and GPU Memory Service.

  • Dynamo Snapshot eliminates cold-start delays by checkpointing and restoring inference worker state on Kubernetes.
  • Optimizations include KV cache unmapping, parallel memfd restore, Linux native AIO, and GPU Memory Service (GMS).
In-site article

LANTERN: Layered Archival and Temporal Episodic Retrieval Network for Long-Context LLM Conversations

LANTERN is a lightweight memory layer that proactively archives conversation turns and restores details after compaction via hybrid retrieval, requiring zero LLM calls and <25ms latency per turn. It recovers 78.3% of lost facts, outperforming MemGPT, and improves accuracy of production LLMs by 8.4 percentage points on average.

  • LANTERN is a zero LLM-call memory layer with <25ms latency per turn, recovering lost details after context compaction.
  • On 94 real conversations, LANTERN-Rerank recovers 78.3% of verifiable facts, outperforming MemGPT's 72.4%.
In-site article

NVIDIA AI Releases Nemotron 3 Ultra: An Open 550B Mixture-of-Experts Hybrid Mamba-Transformer for Long-Running Agents

NVIDIA has released Nemotron 3 Ultra, a 550B total (55B active) open Mixture-of-Experts hybrid Mamba-Transformer for long-running agents. It pairs a 1M-token context with up to ~6x higher inference throughput than comparable open LLMs at on-par accuracy, and ships with open weights, training data, and recipes under OpenMDW-1.1.

  • Employ hybrid Mamba-Attention architecture; Mamba layers scale sub-quadratically, attention layers ensure precise recall.
  • 550B total parameters, only 55B active per token; utilizes LatentMoE and Multi-Token Prediction for efficiency.
In-site article

Nemotron 3.5 Content Safety: Customizable Multimodal Safety for Global Enterprise AI

NVIDIA releases Nemotron 3.5 Content Safety, a unified model combining multimodal input, multilingual coverage, custom enterprise policy enforcement, and auditable reasoning for content safety. Built on Google Gemma 3 4B IT and fine-tuned with LoRA, it supports explicit training in 12 languages with zero-shot generalization to ~140 languages. New features include custom policy enforcement via natural language specifications and a THINK mode for auditable step-by-step reasoning. The model achieves ~85% average accuracy across multiple multilingual and multimodal safety benchmarks while maintaining a compact 4B-parameter size and low latency. NVIDIA also releases a safety dataset with multimodal, multilingual safety reasoning traces.

  • Nemotron 3.5 unifies multimodal input, multilingual coverage, custom policies, and auditable reasoning.
  • Explicit training in 12 languages with zero-shot generalization to ~140 languages via Gemma 3 base.
In-site article

Mate Security’s Asaf Wiener made every backend engineer a model router. He’s right to.

When AI inference costs threatened Mate Security's runway, CEO Asaf Wiener didn't just cut costs—he restructured the company so that every backend engineer owns model selection, evaluation, and routing for their workloads. This shift from cloud-era opacity to workload-level cost visibility has enabled quality-cost optimization, with open-source models sometimes outperforming frontier APIs on specific tasks. Wiener argues that an AI-native company's only structural advantage is shipping against the best model available that day, enabled by an 'execution mode' culture that avoids legal-policy review cycles and hires for adaptability.

  • Wiener broke down AI inference cost into ~10 sub-lines for workload-level visibility, projecting per-feature cost before shipping.
  • Every backend engineer at Mate runs evals on their workloads, choosing models based on quality and cost, updated continuously.
In-site article

How to Fine-Tune Nemotron 3.5 ASR for Your Language, Domain, or Accent

NVIDIA introduces Nemotron 3.5 ASR, a 600M-parameter streaming multilingual speech-to-text model supporting 40 language-locales with low latency, high accuracy, and built-in punctuation and capitalization. The article details how to fine-tune the model for specific languages, domains, or accents, showing significant WER reductions for Greek and Bulgarian as examples.

  • Nemotron 3.5 ASR is a single-checkpoint streaming multilingual model supporting 40 language-locales.
  • It uses a Cache-Aware FastConformer-RNNT architecture for low latency and high accuracy.
In-site article

LiftQuant: Continuous Bit-Width LLM via Dimensional Lifting and Projection

arXiv:2606.04050v1 Announce Type: new Abstract: Existing quantization methods are fundamentally limited by rigid, integer-based bit-widths (e.g., 2, 3-bit), resulting in a ``deployment gap" where Large Language Models cannot be optimally fitted to specific memory budgets. To bridge this gap, we introduce LiftQuant, a novel framework that enables continuous bit-width control for true Pareto-optimal deployment. The core innovation is a ``lift-then-project" mechanism which approximates low-dimensional weight vectors by projecting a simple 1-bit lattice from a higher-dimensional ``lifted" space. Crucially, the effective bit-width is determined simply by the ratio of the lifted dimension to the original dimension, which allows the bit-width to be tuned quasi-continuous as the dimension is a flexible structural parameter. This projection generates a structured yet non-uniform codebook, capturing the expressive power of Vector Quantization (VQ). While beneficial over VQ, LiftQuant's decoding path relies solely on linear transformations and 1-bit uniform quantizers, retaining hardware-friendly nature. This flexibility is transformative: LiftQuant enables a 70B LLM to be compressed to 2.4 bits to precisely fit a 24GB GPU, where its performance significantly surpasses state-of-the-art 2-bit models fitted on the same device. Our code and ckpt is available at https://github.com/Heliulu/LiftQuant.

  • Existing quantization methods are limited to integer bit-widths, causing a deployment gap.
  • LiftQuant introduces a lift-then-project mechanism for continuous bit-width control.
In-site article

Do Transformers Need Three Projections? Systematic Study of QKV Variants

A systematic study of projection sharing in the query-key-value (QKV) attention mechanism of transformers, showing that sharing key and value projections (Q-K=V) reduces KV cache by 50% with only 3.1% perplexity degradation. Combining with grouped-query attention (GQA) or multi-query attention (MQA) achieves 87.5% and 96.9% cache reduction, respectively, enabling practical on-device inference. Experiments span synthetic tasks, vision, and language modeling. Code is publicly available.

  • Systematically evaluates three projection sharing constraints: Q-K=V, Q=K-V, and Q=K=V.
  • Q-K=V achieves 50% cache reduction with only 3.1% perplexity increase in language modeling.
In-site article

Free vLLM Course: Inference, Compression, Benchmarks

DeepLearning.AI and Red Hat offer a free intermediate course on efficient LLM inference with vLLM, taught by Cedric Clyburn. The course covers quantization, serving with vLLM, and benchmarking, with 9 video lessons, 3 code examples, and a quiz.

  • Learn to apply quantization to reduce model memory footprint and measure accuracy tradeoffs
  • Serve models with vLLM using continuous batching, PagedAttention, and prefix caching
In-site article

With Nvidia Groq 3, the Era of AI Inference Is (Probably) Here

Nvidia unveils the Groq 3 LPU, its first chip dedicated to AI inference, featuring an SRAM-based architecture for ultra-low latency. The chip, which incorporates technology licensed from Groq, works alongside Vera Rubin GPUs to optimize performance through inference disaggregation, signaling a shift toward inference-focused computing in the AI industry.

  • Nvidia announces Groq 3 LPU, its first inference-specific chip, using a linear data flow architecture with on-chip SRAM.
  • The chip achieves 150 TB/s memory bandwidth, seven times that of the Vera Rubin GPU, enabling low-latency token generation.
In-site article

AURA: Action-Gated Memory for Robot Policies at Constant VRAM

AURA-Mem proposes a constant-size recurrent memory for robot policies that writes only when an observation would change the next action, drastically reducing memory writes while maintaining accuracy. It uses a learned gate trained on action-error signal, achieving fixed 4,224-byte inference state vs growing KV-cache. Experiments show matching success rates with up to 7x fewer writes.

  • AURA-Mem replaces KV-cache with a fixed-size (4,224 bytes) recurrent memory for robot policy inference.
  • Action-gating mechanism determines writes based on whether the observation affects future actions, reducing writes by 5-7x.
In-site article

The Sequence Knowledge #870: Liquid Models and the Search for a Post-Transformer Architecture

This article examines the limitations of the Transformer architecture and introduces liquid models as a promising alternative for low-latency, private on-device intelligence.

  • Transformer's global attention leads to high memory and compute costs during inference.
  • Liquid models use dynamics instead of attention, offering efficiency for real-time and edge scenarios.
In-site article

ART: Attention Run-time Termination for Efficient Large Language Model Decoding

Long-context decoding in LLMs is constrained by memory bandwidth for fetching KV cache. Existing methods prune keys before decoding, ignoring joint key-value dependence. ART is a lightweight run-time mechanism that tracks accumulated attention outputs and terminates KV block accesses when contributions become negligible. It is orthogonal to existing methods and achieves 20% higher throughput on LongBench with comparable accuracy.

  • ART is a lightweight run-time mechanism that dynamically terminates unnecessary KV cache accesses by tracking attention outputs.
  • It is orthogonal to existing key-based KV cache management methods and can be seamlessly integrated.
In-site article

DAStatFormer: A Hybrid Multibranch Transformer with Statistical Feature Integration for DAS-Based Pattern Recognitions

DAStatFormer is a hybrid multibranch Transformer that uses statistical features from multiple domains to classify DAS events efficiently, achieving 99.4% accuracy with fewer parameters and lower inference cost.

  • Extracts 24 ANOVA-selected attributes per channel from temporal, waveform, and spectral domains.
  • Uses step-wise and channel-wise attention branches fused via adaptive gating.
In-site article

BitsMoE: Efficient Spectral Energy-Guided Bit Allocation for MoE LLM Quantization

BitsMoE is an efficient quantization framework for Mixture-of-Experts (MoE) large language models. It uses SVD to decompose each MoE layer into a shared basis and expert-specific spectral factors, preserving the shared basis without quantization to maintain cross-expert structure. An integer linear programming formulation minimizes reconstruction loss under a fixed bit budget. Experiments show that BitsMoE significantly reduces accuracy degradation in ultra-low-bit regimes, achieving 12.3× quantization speedup, 27.83 percentage point average accuracy improvement, and 1.76× decoding speedup over GPTQ on Qwen3-30B-A3B-Base at 2 bits.

  • Proposes BitsMoE, which leverages SVD decomposition of MoE layers for fine-grained quantization.
  • Uses integer linear programming for activation-aware mixed-precision bit allocation to minimize reconstruction loss.
In-site article

Serving MiniMax-M3 for efficient inference: Unlocking 1M-Token Context and Multimodality Without Regrets

Together AI optimizes MiniMax M3 serving with KV-block-major sparse attention, paged MSA decode, optimized index scoring, and a Rust-based multimodal gateway, achieving 81–125% throughput improvements across concurrency levels.

  • MiniMax M3 combines coding, agentic workflows, and multimodal reasoning with a 1M-token context window.
  • Together AI's kernel team developed KV-block-major sparse attention and integrated MSA with paged attention.
In-site article

Accelerate LLM model loading and increase context windows with GPUDirect on Amazon FSx for Lustre and TurboQuant

This post explores how combining Amazon FSx for Lustre, NVIDIA GPUDirect Storage, and sharded parallel loading reduces cold-start time-to-first-token for large language models from minutes to seconds, and how TurboQuant KV cache significantly increases context window size.

  • CPU-based model loading is a cold-start bottleneck, taking 10–20 minutes for a 405B model.
  • FSx for Lustre with GPUDirect Storage enables direct GPU HBM loading via EFA, bypassing CPU.
In-site article

Tether brings TurboQuant to QVAC SDK, its local AI engine

QVAC SDK 0.12.0 introduces TurboQuant, a KV-cache quantization algorithm that reduces context memory consumption by up to 5x, enabling full 262K-token contexts on consumer GPUs. It works without model retraining and is based on Google Research's ICLR 2026 paper.

  • TurboQuant compresses KV cache from 16-bit to ~3-bit with minimal accuracy loss.
  • Low-VRAM devices like RTX 5060 can now handle full 262K context.
In-site article

SANA-Streaming: Real-time Streaming Video Editing with Hybrid Diffusion Transformer

SANA-Streaming is a system-algorithm co-designed framework for high-resolution real-time streaming video editing on consumer GPUs. It features a Hybrid Diffusion Transformer, Cycle-Reverse Regularization, and efficient system co-design, achieving 24 FPS at 1280x704 on an RTX 5090. Experiments show significant improvements in temporal coherence and throughput over state-of-the-art.

  • Hybrid Diffusion Transformer uses softmax attention in select blocks to enhance local modeling while preserving linear layer efficiency.
  • Cycle-Reverse Regularization enforces temporal consistency via flow matching without requiring paired long videos.
In-site article

Trajectory Releases a Concurrent Multi-LoRA Training Stack for Continual Learning, Reporting a 2.81× Experiment-Throughput Gain

Trajectory, working with UC Berkeley Sky Lab and Anyscale, built a concurrent multi-LoRA training stack for continual learning. It maps each RL experiment to a dedicated LoRA adapter on an always-hot engine, reporting a 2.81× end-to-end experiment-throughput gain over a single-tenant baseline with no reward regression. The code is open-sourced in NovaSky-AI/SkyRL.

  • Trajectory introduces C-LoRA, a concurrent multi-LoRA training stack achieving 2.81× experiment-throughput gain.
  • Each experiment uses a dedicated LoRA adapter on a warm engine, leveraging vLLM multi-LoRA inference for concurrency.
In-site article

Where the AI Hardware Market Is: A Memory Problem Stack

This article analyzes the memory bottleneck in AI hardware, particularly during LLM inference. It covers approaches at the chip level (Groq, Cerebras, MatX, d-Matrix), inference engines (RadixArk, Inferact), KV cache infrastructure (TensorMesh/LMCache), and packaging/interconnect (CoWoS). The key insight: the market is a stack of memory problems, and durable companies need to own a control point that cannot be internalized elsewhere in the stack.

  • Modern GPU tensor throughput far outpaces HBM bandwidth, causing underutilization during decode
  • Solutions target memory at chip, engine, cache, and packaging levels
In-site article

Overline

Overline is a Chrome extension that provides real-time AI captions and live translation for any browser video, with sub-second latency and no need for subtitles.

  • Real-time AI captions and live translation for browser videos
  • Works on YouTube, Netflix, Twitch, Zoom, and more
In-site article

How Together AI built the world’s fastest speech-to-text stack

Together AI built the fastest speech-to-text stack on Artificial Analysis by treating ASR as a full-path systems problem, not just a GPU inference problem. This article details optimizations including TensorRT multi-profile encoders, conditional CUDA graphs, shared memory, evented I/O, and gc.freeze() to eliminate tail latency.

  • Together AI achieved fastest STT by optimizing the entire system path, not just GPU inference.
  • Key techniques: TensorRT multi-profile encoders, conditional CUDA graphs, zero-copy shared memory, and evented I/O.
In-site article

Training Azerbaijani language models on Amazon SageMaker AI

Azercell Telecom collaborated with the AWS Generative AI Innovation Center to build an Azerbaijani LLM on Amazon SageMaker AI, achieving 23% higher training throughput, 58% lower peak GPU memory, and 2× token efficiency via custom tokenizer, FSDP, and Liger Kernel optimizations.

  • Azercell developed a production-ready Azerbaijani LLM framework using Amazon SageMaker AI.
  • Custom tokenizer reduced tokens per word from 3.22 to 1.59, doubling encoding efficiency.
In-site article

Perplexity AI Open-Sources Unigram Tokenizer That Achieves 5x Lower p50 Latency Than Hugging Face tokenizers Crate

Perplexity AI open-sourced a Rust reimplementation of their Unigram tokenizer, achieving 5x lower latency than Hugging Face's tokenizers crate and reducing CPU utilization by 5-6x in production. The optimizations include double-array trie, bitmap packing, and huge pages.

  • Perplexity AI rewrote the Unigram tokenizer in Rust, achieving 5x lower p50 latency vs Hugging Face tokenizers crate.
  • Three optimizations: double-array trie, bitmap and cache-line packing, and huge pages.
In-site article

A Coding Guide to Implement a pgvector-Powered Semantic, Hybrid, Sparse, and Quantized Vector Search System

This tutorial builds a complete pgvector playground in Google Colab, covering installation, embedding creation, HNSW indexing, semantic search, filtered search, distance metric comparisons, half-precision storage, binary quantization, sparse vector search, hybrid retrieval, and vector aggregation. All using open-source tools without external API keys.

  • Set up PostgreSQL with pgvector extension in Google Colab from scratch.
  • Generate embeddings with SentenceTransformers and build HNSW indexes for efficient search.
In-site article

SCALE-COMM: Shared, Contrastively-Aligned Latent Embeddings for MARL Communication

SCALE-COMM is a self-supervised framework that decouples communication learning from policy optimization, learning compact, stable, and policy-relevant latent messages to improve coordination in multi-agent reinforcement learning. It outperforms existing methods on benchmarks and a realistic warehouse task, offering better stability, sample efficiency, and throughput.

  • Decouples communication learning from policy optimization to reduce interference.
  • Uses contrastive learning to enforce consistency across agents and time.
In-site article

Unlocking Fine-Grained and Within-Utterance Speaking Style Control in Prompt-Based Text-to-Speech Models

This paper proposes novel techniques for inter-utterance style interpolation and intra-utterance style transition in prompt-based TTS models, addressing limitations of coarse global control. Methods include direction vector interpolation and KV-cache swapping with sliding-window attention masking. Experiments show high success rates in gender conversion and smooth style transitions within utterances.

  • Inter-utterance interpolation via direction vectors between contrastive style prompts enables smooth transitions.
  • Intra-utterance transition uses KV-cache swapping and sliding-window masking to overcome attention bias.
In-site article

$E^3$-Agent: An Executable and Evolving Agent for Resource Management of Edge Generative Inference

This paper presents $E^3$-Agent, an executable and evolving agent for resource management of edge AIGC. It separates a fast-path router from a slow-path LLM meta-controller, learns online from execution feedback, and adapts to unknown time-varying service-time mappings. Evaluation shows 65%-73% latency reduction over static baselines and effective stutter suppression.

  • Edge generative inference faces unknown per-device performance and non-stationarity.
  • $E^3$-Agent uses a dual-path architecture: fast router + slow LLM meta-controller.
In-site article

Soro: A Lightweight Foundation Model and Chatbot for Tajik

Soro is a family of Tajik-specialized conversational LLMs built on Gemma 3, using 1.9B token Tajik continual pretraining and 40K instruction tuning examples. It substantially outperforms same-size Gemma 3 on Tajik benchmarks while retaining English performance. FP8/INT4 quantization preserves gains for edge deployment. An education pilot is underway in Tajikistan.

  • Based on Gemma 3, with 1.9B token Tajik continual pretraining and 40K instruction tuning examples.
  • Substantially outperforms same-size Gemma 3 on Tajik benchmarks, retains English performance.
In-site article

Reliable LLM Inference at Scale

At Databricks, we’ve built a unique inference platform that serves every frontier model, from open source to proprietary, powering some of the largest agentic applications. Serving over 120T tokens per month, we tackle challenges of reliability and latency through abstractions like model units for capacity management, cost-aware load balancing and autoscaling that save over 80% GPU costs, and runtime reliability mechanisms including black-box health checks that detect silent failures. Profiling multimodal bottlenecks unlocked 3x throughput gains.

  • Databricks' inference platform serves frontier models including open source and proprietary, handling 120T tokens/month.
  • Model units provide a VM-like abstraction for capacity management, enabling cost-aware routing and scaling.
In-site article

Meet EAGLE 3.1: The Speculative Decoding Algorithm That Fixes Attention Drift in LLM Inference

The EAGLE team, vLLM team, and TorchSpec team have jointly released EAGLE 3.1 to fix speculative decoding instability in production LLM serving. The algorithm addresses attention drift through two architectural improvements: FC normalization and post-norm hidden-state feedback. Benchmarks show up to 2× longer acceptance length in long-context tasks and 2.03× per-user throughput on Kimi K2.6 at concurrency 1. EAGLE 3.1 is backward-compatible with EAGLE 3 checkpoints and has been merged into vLLM main, shipping in v0.22.0.

  • EAGLE 3.1 fixes attention drift, where the draft model gradually shifts focus from context tokens to its own generated tokens during deep speculation.
  • Two architectural fixes: FC normalization to stabilize hidden states, and feeding normalized states back to the next step.
In-site article

Design a High-Precision Retrieve-and-Rerank Pipeline with ZeroEntropy Zerank-2 Reranker

This tutorial demonstrates how to use zeroentropy/zerank-2-reranker, a 4B Qwen3-based cross-encoder reranker, to enhance retrieval quality. It covers environment setup, pairwise scoring, model.rank usage, a two-stage retrieve-and-rerank pipeline, NDCG@10 evaluation, cross-domain testing in finance, legal, and code, and batched throughput measurement.

  • zerank-2 reranker improves retrieval precision beyond simple embedding similarity.
  • A two-stage pipeline (bi-encoder retrieval + cross-encoder reranking) optimizes search quality.
In-site article

RED: Adaptive Real-Time DAG Scheduling for Robotic Inference under Environmental Dynamics

RED is a real-time scheduling framework for multi-task deep neural network workloads on resource-constrained robotic platforms. It adapts to runtime environmental changes by assigning intermediate sub-deadlines, leveraging MIMONet weight sharing, and reconstructing computation graphs. Implemented on NVIDIA Jetson and Apple M-series platforms, RED consistently outperforms existing methods in throughput, deadline satisfaction, robustness, adaptability, and overhead.

  • RED assigns intermediate sub-deadlines to accommodate evolving computation graphs and asynchronous inference.
  • It leverages MIMONet's shared parameters to improve schedulability through workload refinement and graph reconstruction.
In-site article

ActQuant: Sub-4-bit Action-Guided Quantization for Vision-Language-Action Models

ActQuant is an action-guided mixed-precision post-training quantization framework for Vision-Language-Action (VLA) models, enabling sub-4-bit weight quantization through a two-stage approach that maintains high success rates on the LIBERO benchmark and a real UR3 robotic arm, significantly reducing memory footprint.

  • ActQuant employs action-aware mixed-precision quantization to preserve VLA model performance under sub-4-bit weight quantization.
  • The two-stage framework includes an inter-tensor bit allocator and an intra-tensor scale optimizer focusing on action-critical weights.
In-site article

AERIC: Anticipatory Hidden-State Monitoring for Implicit Harmful Dialogue

AERIC is a lightweight safety monitor that reads hidden states during decoding to anticipate implicit harmful content without extra forward passes. With only 387 trainable parameters, it outperforms larger models on multiple benchmarks and increases latency by just 2.34%.

  • AERIC predicts harmful content early by analyzing the model's internal hidden states.
  • Combines short-horizon hazard forecasting, support-sensitive suppression, and prompt-conditioned residual scoring.
In-site article

Toward Reliable Design of LLM-Enabled Agentic Workflows: Optimizing Latency-Reliability-Cost Tradeoffs

This paper analyzes the fundamental tradeoffs among latency, reliability, and cost in LLM-enabled agentic workflows. It introduces performance models using a parametric exponential reliability function for LLM agents and proposes a water-filling token allocation policy under latency and cost constraints.

  • LLM agentic workflows involve tradeoffs among latency, reliability, and cost.
  • A parametric exponential reliability function models LLM agent performance.
In-site article

Together AI Open-Sources OSCAR: An Attention-Aware 2-Bit KV Cache Quantization System for Long-Context LLM Serving

Together AI has released OSCAR (Offline Spectral Covariance-Aware Rotation), an INT2 KV cache quantization method for long-context LLM serving. Unlike prior rotation-based approaches that apply data-oblivious Hadamard transforms, OSCAR derives separate rotations for keys and values from attention-aware covariance structures estimated offline. At 2.28 bits per KV element, OSCAR reduces the BF16 accuracy gap to 3.78 points on Qwen3-4B-Thinking-2507 and 1.42 points on Qwen3-8B, while delivering approximately 8× KV memory reduction and up to 3× decode speedup at 100K context length.

  • OSCAR is a 2-bit KV cache quantization method using attention-aware rotations that maintain near-BF16 accuracy.
  • It derives rotations from query and value covariances via offline calibration, directing quantization noise to attention-insensitive directions.
In-site article

DeepSeek V4 Gets Even Cheaper: New Tool Boasts 99.82% Cache Hit Rate, Slashes Bills to 20%

One month after DeepSeek V4's release, the open-source community unveiled Reasonix, a tool specifically designed to minimize API costs by maximizing cache efficiency. It achieves a staggering 99.82% cache hit rate, reducing a $61 bill for 400M+ tokens to just $12.

  • Reasonix is a dedicated coding harness for DeepSeek, focusing on cost reduction.
  • Its cache-first loop, tool-call repair, and automatic context compression maintain over 90% cache hit rate in long sessions.
In-site article

More growth tags

Inference Cost AI News | AI News Hub