Inference Cost AI News

Inference Cost updates

Closing the Loop: Training-Free Revisit Consistency for Autoregressive Generative Rendering

2026-07-27 04:00 UTC

Recent conditional video generation models can convert 3D engine renderings like depth maps into photorealistic videos, but suffer from appearance inconsistencies when the camera revisits a location after context eviction in long-horizon autoregressive generation. This paper proposes a training-free method that leverages correspondences from the 3D engine: temporal correspondence retrieves pose-matched historical latent chunks into the KV cache as loop-closure memory, while spatial correspondence biases token-level attention toward geometrically corresponding regions. Evaluations on loop-closure trajectories from TartanAir and TartanGround show improved revisit consistency without sacrificing video quality, outperforming existing training-free baselines.

Autoregressive video generation models produce inconsistent appearances when camera revisits a location due to context eviction.
Training-free method that exploits 3D engine correspondences: temporal retrieval and spatial attention biasing.

Black Forest Labs Releases FLUX 3: A Multimodal Flow Model for Image, Video, Audio and Robot Action Prediction

2026-07-26 17:50 UTC

Black Forest Labs (BFL) releases FLUX 3, a multimodal foundation model that learns from images, videos, and audio within a single architecture. It is the first FLUX model to output video, audio, and action predictions from one set of weights. The model builds on the Self-Flow method and excels in video generation, producing clips up to 20 seconds with native audio. In human preference tests, FLUX 3 outperforms many competitors. The same backbone also drives the FLUX-mimic robot policy with sub-80 ms latency.

FLUX 3 is a multimodal foundation model unifying images, videos, and audio. It can generate up to 20-second video clips with native audio, leading in human preference evaluations.
Training uses the Self-Flow method; video consumes over 95% of compute, audio less than 0.5% of tokens.

Datalab’s Marker 2 vs MinerU, Docling and LiteParse: 76.0 on olmOCR-bench at 5× MinerU’s Throughput

2026-07-25 02:14 UTC

Datalab released Marker 2, a full rewrite of its open-source document conversion pipeline. It scores 76.0% on olmOCR-bench in balanced mode, sustaining 2.9 pages per second on a single B200 GPU—over 5× MinerU's pipeline throughput—while outperforming Docling on both accuracy and speed. The article compares Marker 2 with MinerU, Docling, and LiteParse across performance, licensing, and use cases.

Marker 2 balanced scores 76.0% on olmOCR-bench at 2.9 pg/s, over 5× MinerU's pipeline throughput.
It beats Docling on both accuracy (76.0% vs 50.3%) and speed (2.9 vs 2.1 pg/s).

Claude Opus 5 arrives with near Fable performance at half the price

2026-07-24 17:00 UTC

Anthropic's latest Claude upgrade targets developers and enterprises with stronger coding, better reasoning efficiency, prompt-cache-friendly tool changes, and near-Fable performance at Opus pricing.

Near Fable 5 performance at half the price.
Uses 26% fewer tokens on average compared to Opus 4.8.

Break Through the Compression Bottleneck: From Theory to Practice

2026-07-24 04:00 UTC

As language model parameter sizes grow, effective compression is essential to reduce computational and memory overhead. Existing methods suffer from performance degradation at high compression ratios. This paper provides the first mathematical proof that low-rank decomposition and quantization are non-orthogonal—their combination causes significant performance loss. The authors propose Diagonal Adhesive Method (DAM) to effectively combine both techniques and mitigate the loss.

First mathematical proof that low-rank decomposition and quantization are non-orthogonal, introducing additional errors when combined.
Experiments on large language models confirm performance degradation from combining the methods.

DecodeShare: Tracing the Shared Subspace of LLM Decode-Time Decisions

2026-07-24 04:00 UTC

Large language models (LLMs) handle many tasks with one set of parameters, but under KV-cached inference it is unclear what task-general structure, if any, is used at decode time rather than during prefill. We propose DecodeShare, a protocol that identifies a low-dimensional subspace consistently shared across tasks in decode-time hidden states, and then tests its causal role by removing that subspace only during decoding. In our experiments, disturbing the discovered shared subspace degrades decision performance far more than disturbing either a prefill-derived or random subspace under the same intervention budget. We further show this decode-shared subspace has practical consequences for activation steering: common steering directions can overlap the task-general decode channel. Projecting out this shared subspace directly separates the functional roles of the two components, while evaluating steering vectors at decode-time yields more reliable signal for downstream deployment than prefill-based proxies. Despite its compactness, the shared subspace can serve as a high-leverage causal channel at decode time.

Proposes DecodeShare protocol to identify a shared low-dimensional subspace in LLM decode-time hidden states.
Disturbing this shared subspace degrades task performance more than disturbing prefill or random subspaces.

InferenceBench: A Benchmark for Open-Ended LLM Inference Optimization by AI Agents

2026-07-24 04:00 UTC

InferenceBench evaluates AI agents on open-ended LLM inference optimization. Agents must deploy an inference server and optimize speed under a two-hour budget across four scenarios: prefill latency, decode latency, concurrent throughput, and a balanced mix. Agents improve up to 8x over baselines but fall short of simple hyperparameter search (11.53x). Analysis reveals agents converge on a single framework, lacking diverse exploration.

InferenceBench tasks agents with optimizing LLM inference speed in a realistic server setup across four scenarios.
Frontier agents achieve up to 8.08x speedup over naive PyTorch baseline, but underperform simple hyperparameter search (11.53x).

DC-Leap: Training-Free Acceleration of dLLMs via Draft-Guided Contiguous Leaping Decoding

2026-07-24 04:00 UTC

DC-Leap is a training-free framework that accelerates diffusion large language models (dLLMs) by addressing the Joint Probability Dependence Error (JPDE) that leads to overly conservative confidence thresholds. It introduces Dynamic Contiguous Verification and draft-guided decoding to achieve up to 53.19x speedup on MBPP long-sequence generation and up to 105.02x when combined with KV-Cache, while maintaining generation quality.

Overcomes conservative thresholds caused by Joint Probability Dependence Error (JPDE) in dLLM parallel decoding.
Introduces Dynamic Contiguous Verification to integrate causal constraints and neutralize JPDE.

“We love the world where we can use both”: How NVIDIA thinks about local and frontier models

2026-07-23 18:12 UTC

NVIDIA's senior director of generative AI software, Joey Conway, discusses how local open models are increasingly working alongside frontier models, with routers deciding which to use, enabling organizations to achieve better outcomes at lower cost and latency.

NVIDIA advocates combining local and frontier models via intelligent routing to match task complexity.
Hardware like DGX Spark allows running up to 200B-parameter models locally, offering full data control.

VQ-Transplant: Efficient VQ-Module Integration for Pre-trained Visual Tokenizers

2026-07-23 04:00 UTC

VQ-Transplant introduces a lightweight framework for plug-and-play integration of new vector quantization modules into frozen pre-trained tokenizers without costly end-to-end retraining. A lightweight decoder adaptation trained for only 5 epochs on ImageNet-1k mitigates quantization mismatch, achieving near state-of-the-art reconstruction fidelity on industry-level models like VAR while reducing training cost by 95%. This democratizes quantization research, enabling resource-efficient exploration of novel VQ techniques.

Plug-and-play VQ module replacement without retraining encoder-decoder.
Lightweight decoder adaptation with only 5 epochs on ImageNet-1k.

NEXUS: Structured Runtime Safety for Tool-Using LLM Agents

2026-07-23 04:00 UTC

NEXUS is a structured-plan safety monitor that combines deterministic safety rules, argument-level inspection, and a calibrated logistic-regression risk score to allow, block, request confirmation, or request revision for LLM agent actions. It achieves strong benchmark results with minimal latency.

NEXUS uses four intervention actions for fine-grained safety control.
It outperforms rule-only methods by combining rules with a learned risk score.

Benchmarking Confidential GPU Inference on NVIDIA H100 under Intel TDX

2026-07-23 04:00 UTC

A new study benchmarks the performance cost of enabling confidential computing for LLM inference on an NVIDIA H100 GPU under Intel TDX. Using Mistral-7B and Qwen3-30B-A3B models, results show a 21.8%-27.8% increase in time-to-first-token and 17.7%-21.1% drop in global token throughput in confidential mode. The larger model reaches saturation earlier, highlighting the need for capacity planning adjustments.

Confidential computing is becoming a practical requirement for AI inference but introduces performance overhead.
The study tests two LLMs on an H100 GPU within an Intel TDX confidential instance.

Intelligent Multi-UAV Navigation in ITNTNs: A Hierarchical LLM Approach

2026-07-22 04:00 UTC

A hierarchical LLM framework combining cloud-based and edge LLMs with DRL for UAV navigation in ITNTNs, reducing collisions and improving throughput.

Cloud-based LLM on HAPS handles global load balancing
Edge-LLMs on UAVs translate local observations to tactical sub-goals

Recti-Q: Feature-Space Rectification for Out-of-Distribution-Robust Quantized Perception in Edge Robotics

2026-07-22 04:00 UTC

The paper identifies a robustness gap introduced by post-training quantization (PTQ) in robotic perception models deployed on edge devices. While PTQ maintains in-distribution accuracy, it reduces reliability under distribution shifts. The authors propose Recti-Q, a lightweight feature-space rectification method that uses a frozen quantized backbone and a small LoRA adapter, achieving significant robustness recovery with minimal overhead.

PTQ degrades robustness under distribution shifts despite preserving in-distribution accuracy.
Recti-Q freezes quantized backbone and trains a small LoRA adapter with only source data.

Surprise Forcing: What to Remember, When to Skip in Long Video Generation

2026-07-22 04:00 UTC

Surprise Forcing is a training-free framework that improves long video generation by addressing two limitations of streaming autoregressive diffusion: bounded context and fixed denoising schedule. It uses a Surprise-Gated Memory Bank to selectively retain important visual evidence and Surprise-Aware Denoising to skip denoising steps for easy chunks. Experiments show improved consistency and quality while maintaining real-time throughput.

Streaming autoregressive diffusion suffers from bounded context and fixed denoising schedule, leading to uniform resource allocation and forgetting of distant visual evidence.
Surprise Forcing treats these limitations as online resource-allocation problems and requires no additional training.

Beyond Accuracy and Cost: Latency-Aware LLM Query Routing for Dynamic Workloads

2026-07-22 04:00 UTC

Modern LLM query routers often ignore generation latency, focusing only on accuracy and cost. This paper introduces a lightweight latency estimator that simulates autoregressive token batch processing to predict time-to-first-token (TTFT), and integrates it into a router that jointly optimizes latency, accuracy, and cost. Experiments show up to 40% improvement in accuracy-cost utility while maintaining the same latency as standard load-balancing approaches.

Current query routers are latency-agnostic, relying on load-balancing policies that ignore accuracy and cost.
The proposed lightweight latency estimator simulates batch processing in serving frameworks to estimate TTFT.

Hardware Mechanisms to Dynamically Throttle AI Performance

2026-07-22 01:01 UTC

As AI models integrate into critical systems, existing software safeguards may be bypassed. Researchers propose microarchitecture knobs that dynamically control GPU memory subsystem resources (L2 cache size, latency, bandwidth, shared memory port access rate) to limit AI performance at runtime, achieving up to 80% performance reduction with negligible cost.

Software safeguards can be potentially bypassed by sufficiently intelligent AI; hardware-level safety is essential.
Four microarchitecture knobs proposed: L2 size, L2 latency, L2 bandwidth, and shared memory port access rate.

Agent swarms are great for local AI

2026-07-22 00:43 UTC

This article examines the poor tokenomics of local AI development, where running a single agent on expensive hardware yields low throughput. It introduces agent swarms—parallel task execution across many agents—as a game-changer. By saturating GPUs with parallel workloads, local hardware becomes cost-effective compared to API calls. Detailed calculations show that a 32-agent swarm on a local rig costs only a fraction of API-based alternatives, making local AI worthwhile for the first time.

Single-agent local AI has high hardware cost and low token throughput.
Agent swarms distribute tasks across many parallel agents, drastically improving GPU utilization.

Validating Distributed LLM Serving Benchmarks with NVIDIA srt-slurm, SLURM Recipes, Parameter Sweeps, and Pareto Analysis

2026-07-21 16:29 UTC

This tutorial explores NVIDIA's srt-slurm framework, learning how to use srtctl to convert declarative YAML configurations into reproducible SLURM benchmark workflows for distributed LLM serving. We set up the project in Google Colab, inspect its internal architecture, define a cluster configuration, dry-run built-in and custom recipes, and model a disaggregated prefill-and-decode deployment for DeepSeek-R1. We also generate parameter sweeps, interact with the typed Python API, validate expanded configurations, and analyze simulated benchmark results through a throughput-versus-latency Pareto frontier.

srtctl converts YAML configs into SLURM benchmark workflows
Supports disaggregated prefill and decode deployments

Google’s Gemini 3.6 Flash targets enterprise agent token costs

2026-07-21 16:06 UTC

Google has released Gemini 3.6 Flash and 3.5 Flash-Lite as new workhorses designed to cut latency and token costs for enterprise AI agents. The new models offer significant performance improvements, targeted pricing, and integrated computer-use tools, with enterprise partners already deploying them in production.

Gemini 3.6 Flash reduces output tokens by 17% (up to 65% in specific tests), priced at $1.50/1M input and $7.50/1M output tokens.
Gemini 3.5 Flash-Lite offers high throughput at lower cost ($0.3/1M input, $2.5/1M output), suitable for high-volume agentic tasks.

NVIDIA Vera Rubin Driving Performance Per Watt, Lowest Token Cost for Partners Worldwide

2026-07-21 15:36 UTC

NVIDIA Vera Rubin NVL72 production is ramping up with partners CoreWeave, Google Cloud, Microsoft Azure and Oracle Cloud Infrastructure. The platform delivers highest performance per watt and lowest token cost, with 10x more throughput per megawatt than Grace Blackwell NVL72 in benchmarks. It also powers Europe's open-model era through a partnership between Microsoft and Mistral.

Vera Rubin NVL72 production ramping with 350+ factory sites in 30 countries
10x more tokens per megawatt and 1/10th cost per million tokens vs. previous gen

Google ships 3 new Gemini models. Just not the one everyone’s waiting for.

2026-07-21 15:00 UTC

Google released Gemini 3.6 Flash, a cheaper and faster 3.5 Flash-Lite, and 3.5 Flash Cyber, but the flagship 3.5 Pro remains delayed. 3.6 Flash shows significant improvements in benchmarks and lower output costs. 3.5 Flash-Lite targets high-throughput tasks with strong cost-performance. 3.5 Flash Cyber, for cybersecurity, matches Opus 4.6 but is limited to pilot access.

Google launched three new Gemini models: 3.6 Flash, 3.5 Flash-Lite, and 3.5 Flash Cyber, but the flagship 3.5 Pro is delayed.
3.6 Flash shows major gains in coding and ML benchmarks, with reduced output pricing.

Nativ: Run AI models locally on your Mac

2026-07-21 14:22 UTC

Prince Canuma, creator of MLX-VLM, launches Nativ, a macOS desktop app that wraps MLX with a chat interface and local API server, automatically detecting models in your Hugging Face cache.

Nativ is a macOS desktop app for running AI models locally.
It provides a chat interface and a localhost API server, similar to LM Studio.

Fully-sensorized smart-eyewear platform for on-device Machine Learning

2026-07-21 04:00 UTC

This paper presents ARGO, a smart eyewear platform leveraging the STM32N6 microcontroller and its integrated NPU for on-device machine learning, minimizing latency and preserving privacy. Through holistic co-design of hardware, firmware, and AI, an optimized YOLOv11 model is deployed for real-time urban obstacle recognition, introducing Head-wise Parallel Attention (HPA) for efficient NPU execution. The model achieves mAP50-95 of 24 with only 2.483 MB memory footprint. The platform integrates multimodal sensors, runs at 10 FPS, and provides ~113 minutes of autonomy on a 200 mAh battery, demonstrating the feasibility of high-performance, privacy-preserving assistive devices.

ARGO smart eyewear uses STM32N6 MCU and NPU for on-device ML, avoiding cloud dependency
Head-wise Parallel Attention (HPA) optimizes YOLOv11 model for NPU, achieving mAP50-95 24 under tight memory

Deterministic Replay for AI Agent Systems

2026-07-21 04:00 UTC

arXiv:2607.16200 presents agrepl, a CLI framework for deterministic replay of AI agent executions. Using a MITM proxy, it records external interactions and replays them in isolation, achieving perfect fidelity (F=1.0) and 98.3% latency reduction.

AI agent systems are inherently non-deterministic due to LLM variance and external state. Existing tools can't reproduce runs in isolation.
agrepl intercepts all external interactions via MITM proxy and replays them in a sandbox.

Kimi K3 open-weight model: China’s biggest AI is a bet on memory, not compute

2026-07-20 09:00 UTC

Moonshot AI’s Kimi K3, a 2.8-trillion-parameter open-weight model, uses mixture-of-experts, quantization, and attention caching to trade compute for memory, circumventing US chip restrictions. While it tops benchmarks in coding, deployment requires data-center infrastructure, pricing is high, and software support is incomplete.

Kimi K3 has 2.8 trillion parameters, making it the largest open-weight model released.
It employs mixture-of-experts, quantisation-aware training, and Kimi Delta Attention to reduce compute and memory demands.

KDnuggets Weekly Roundup: Week of July 13, 2026

2026-07-18 13:00 UTC

This week's highlights include stopping if-else chains with the registry pattern, 12 ways to reduce LLM latency and costs, 5 real-world SQL projects for your portfolio, Git worktrees for AI development, structured generation with Outlines, 7 Python frameworks for local AI agents, 10 YouTube channels to stay ahead in AI, getting started with Conductor for Gemini CLI, 5 free resources on agentic AI, and working with Pi coding agents.

Registry pattern replaces brittle if-else chains for extensible code
Optimize LLM inference by minimizing tokens, model routing, and caching

Meta’s Spark Muse 1.1 is now available on Databricks, fully governed by Unity AI Gateway

2026-07-17 13:08 UTC

Meta's new Muse Spark 1.1 model is now available on Databricks via Model Provider Services (MPS) in Unity AI Gateway. This service allows organizations to register providers once in Unity Catalog, eliminating API key sprawl and centralizing governance through familiar permissions, rate limits, and guardrails. Additionally, every request is automatically tracked with token usage, latency, cost attribution, and audit logs for end-to-end observability.

Access Meta's new Muse Spark 1.1 model on Databricks through Model Provider Services in Unity AI Gateway.
Register providers once in Unity Catalog to centralize access, rate limits, and guardrails.

NVIDIA AI Releases Nemotron 3 Embed: An Open Embedding Collection Whose 8B Checkpoint Ranks #1 on RTEB

2026-07-17 07:53 UTC

NVIDIA released Nemotron 3 Embed on July 15 and 16, 2026. The collection has three open checkpoints: Nemotron-3-Embed-8B-BF16, Nemotron-3-Embed-1B-BF16, and Nemotron-3-Embed-1B-NVFP4. The 8B ranks #1 on RTEB at 78.46 average NDCG@10. The 1B came from ModelOpt NAS pruning plus COS+MSE distillation from the 8B teacher. NVFP4 retains 99%+ of BF16 retrieval accuracy at up to 2x Blackwell throughput. All three run 32,768-token inputs under OpenMDW-1.1.

Nemotron-3-Embed-8B-BF16 ranks #1 on RTEB with 78.46 average NDCG@10
Three open checkpoints: 8B BF16, 1B BF16, and 1B NVFP4

Polestar: Drift-Aware Cache Calibration and Token Commitment for Efficient Inference of Diffusion LLMs

2026-07-17 04:00 UTC

Polestar is a training-free inference framework that addresses KV-cache reuse and decoding parallelism challenges in diffusion LLMs by leveraging token representation drift. It consists of Polestar-Cache for sparse cache refreshes and Polestar-Commit for identifying commit-ready tokens, achieving up to 10.73% accuracy improvement and 3.7x higher throughput on math and coding benchmarks.

Polestar uses token representation drift to jointly optimize cache efficiency and decoding parallelism.
Polestar-Cache identifies stale KV-cache positions for sparse refreshes, enabling efficient reuse.

A 3DGS-Driven Dynamic Viewpoint and Vibrotactile Framework for Subsea Teleoperation Validated via fNIRS

2026-07-16 04:00 UTC

A multimodal teleoperation architecture for ROVs using 3D Gaussian Splatting to generate occlusion-free exocentric views and a vibrotactile suit for haptic cues. A human study with 30 participants showed the exocentric view significantly improves performance under high latency, with fNIRS indicating sustained executive control rather than cognitive overload.

DAVS uses real-time 3D Gaussian Splatting to create an occlusion-free exocentric viewpoint
Vibrotactile suit maps obstacle clearance to intuitive haptic cues, reducing sensory workload

Transforming LLMs into Efficient Cross-Encoders via Knowledge Distillation for RAG Reranking

2026-07-15 04:00 UTC

This work fine-tunes LLaMA 3 (8B) as an efficient drop-in reranker via supervised fine-tuning and 4-bit quantization, replacing cross-encoders in RAG pipelines. It achieves 14-21% improvement in answer relevancy, context precision, answer similarity, and answer correctness on a domain-specific QA benchmark while reducing inference overhead.

Traditional cross-encoders have quadratic inference costs limiting real-time RAG deployment.
Two-stage pipeline: supervised fine-tuning with LoRA on a custom dataset, then 4-bit quantization.

Semidirect Fourier Delta Attention: Phase-Controlled Delta Memory with Constructive Chunk-WY Kernels

2026-07-15 04:00 UTC

Linear attention replaces softmax attention's growing KV cache with a fixed recurrent state, but this compression limits exact state tracking and long-context memory. This paper introduces Semidirect Fourier Delta Attention (SFDA), a phase-controlled generalization of Kimi Delta Attention that replaces real diagonal decay with block-rotational Fourier control. The main result is a constructive chunk-WY factorization, enabling exact affine chunk transfer, formal stability and complexity bounds, and a compact characterization of phase-plus-low-rank memory. Experiments show SFDA learns cyclic memory while the phase-disabled KDA baseline remains near chance.

SFDA improves linear attention via phase-controlled Fourier memory, addressing limitations in state tracking and long-context memory.
A constructive chunk-WY factorization is proposed, bounding rank growth within fixed chunks for efficient transfer and stability.

OS –> Prod Survey

2026-07-14 18:53 UTC

The State of Open Source AI report reveals that open-weight models have achieved near-parity with closed models in capability, while inference costs dropped 50x in 36 months. Open models are adopted by 79% of developers but only 51% reach production due to operational challenges. The report emphasizes open source as a sovereignty choice, with over 70 national AI strategies in place.

Open-source AI capability gap to top closed models narrowed to 3.3%, with parity in coding tasks.
GPT-4-class inference cost fell from $20 to $0.40 per 1M tokens, a 50x drop in 36 months.

12 Ways to Reduce LLM Latency and Inference Costs in Production

2026-07-14 12:00 UTC

Scaling LLMs isn’t about adding GPUs. It’s about removing wasted work from every request.

Measure queue time, TTFT, inter-token latency, and cache hit rate before optimizing.
Reduce output tokens by setting realistic limits and asking for concise answers.

Maximizing Human Efficiency in Large-Scale Robot Post-Training via VLAC-Cut Guided Pipeline

2026-07-14 04:00 UTC

This paper proposes a human-efficient post-training pipeline that enables a small number of human operators to supervise multiple robots through specialized division of labor and automatic trajectory segmentation using VLAC-CUT. Validated on four real-world manipulation tasks, the final policies achieve 80%-95% success rates and improve task throughput by 1.7x-4.2x over the base model.

Proposes a human-efficient post-training pipeline with role specialization to reduce task switching and training costs.
Introduces VLAC-CUT, an automatic trajectory segmentation tool for filtering useful rollout data.

Silent Failures in Quantized LLM Reasoning: A Taxonomy-Based Analysis of Hollow Convergence and Failure Mode Shifts

2026-07-14 04:00 UTC

A new study shows that post-training quantization can silently alter how large language models reason even when task accuracy is preserved. Using a six-category failure taxonomy, the researchers classified 30,000 chain-of-thought outputs and found that hollow convergence exhibits a size-dependent shift under NF4 quantization, while shortcut collapse and confidence snowballing undergo qualitative changes. Hollow convergence cannot be reliably detected from surface-level text features, posing a deployment risk.

Post-training quantization can silently alter LLM reasoning while preserving accuracy
Hollow convergence decreases sharply for smaller models under NF4 but remains stable for larger ones

Workload-Driven Optimization for On-Device Real-Time Subtitle Translation

2026-07-14 04:00 UTC

This report studies on-device English-to-Traditional-Chinese subtitle translation for Taiwan under short inputs, short outputs, batch-size-one inference, low latency, and privacy constraints. The authors replace the original 151k-token vocabulary with a 64k-token subtitle-domain tokenizer, perform embedding calibration and fine-tuning, achieving a 59.2% tie-excluded win rate against Google Translate on a subset of OpenSubtitles2024, and a 1.63x speedup on Apple M2.

On-device English-to-Traditional-Chinese subtitle translation optimized for short inputs, low latency, and privacy.
Replaced 151k-token vocabulary with a 64k subtitle-domain tokenizer; embedding calibration and fine-tuning applied.

MawForge: Memory-Bounded Expert Materialization for Local Mixture-of-Experts Inference

2026-07-14 04:00 UTC

A new paper introduces MawForge, a system that enables practical local inference of Sparse Mixture-of-Experts (MoE) language models on memory-constrained unified-memory machines by storing the model on disk and materializing expert tensors on demand into a bounded cache. The system is effective as a measurement substrate but not as a cache-maximization policy.

MawForge stores the full MoE model on disk and materializes routed experts into a bounded execution cache.
It is designed for local inference on constrained unified-memory machines.

Closed-Loop Control with Rule-Aligned Small Language Models and Multi-Agent Self-Correction

2026-07-14 04:00 UTC

This paper presents a closed-loop control framework using a small language model (SLM) aligned via Group Relative Policy Optimization (GRPO). The system integrates an action agent, a digital-twin validator, and a reprompting agent to iteratively correct outputs. In thermal control simulations, it achieves 91.5% action-alignment accuracy with 3.84s inference latency, demonstrating viability for edge autonomous control.

Compact 1.5B parameter SLM (Qwen2.5-1.5B) aligned via GRPO for control reasoning
Multi-agent architecture: action generator, symbolic/digital-twin validator, and reprompting agent for iterative correction

[AINews] Codex usage up >10x in 6 months to 7M users, +1M in the past ~day; did Codex overtake Claude Code??

2026-07-14 01:22 UTC

OpenAI's Codex reaches 7M users, adding 1M in a day, with 10x growth in 6 months. Prime Intellect releases verifiers v1 for agent RL. OpenAI transparently fixes GPT-5.6 Sol usage issues. Grok Build security controversy emerges. Open models and quantization progress. Continual learning research resurfaces.

Codex users grew from ~600k to 7M in 6 months, surpassing Claude Code's growth rate.
Prime Intellect's verifiers v1 redesigns agent RL environment stack with taskset, harness, and runtime.

Show HN: PlanWright – A control plane for AI coding agents

2026-07-13 19:59 UTC

PlanWright is a control plane for AI coding agents that inverts planning and acceptance ceremonies to eliminate human bottlenecks, delivering agent-speed throughput with cryptographic audit trails.

Inverts planning: synthesizes chaotic inputs (transcripts, decks, email, Slack) into structured objectives for agent execution.
Inverts acceptance: triages mechanical checks automatically, routing only judgment calls to humans with signed approvals.

Director: Accelerating Distributed MoE Serving via Online Proactive Expert Placement

2026-07-13 04:00 UTC

Director is a new distributed MoE serving system that minimizes end-to-end latency through prediction-driven, online expert placement. It uses a lightweight cascaded predictor or low-bit quantized replica for expert activation patterns, an online migration module with near-zero downtime, and a relaxation-based optimizer that achieves a (1+ε) approximation ratio in polynomial time. Experiments show an 11–55% reduction in latency for popular MoE models.

Prediction-driven online expert placement
Near-zero downtime expert migration

Signed Symmetric Quantization for Few-Bit Integers

2026-07-13 04:00 UTC

This paper introduces signed symmetric quantization for few-bit integers, addressing clipping errors from standard symmetric quantizers while avoiding the runtime penalty of asymmetric quantization. The method places the extra negative value on the dominant outlier tail, achieving better perplexity and accuracy on large language models at no extra inference cost.

Standard symmetric quantizer clips positive outliers due to signed integer alphabet imbalance, causing non-trivial error at low precision.
Signed symmetric quantization retains symmetric runtime benefits without asymmetric overhead by assigning the extra representable value to the dominant-outlier tail.

KV-PRM: Efficient Process Reward Modeling via KV-Cache Transfer for Multi-Agent Test-Time Scaling

2026-07-13 04:00 UTC

KV-PRM is an efficient process reward model that eliminates text re-encoding by directly using the KV cache from LLM generation, reducing scoring cost from O(L²) to O(L). It matches or outperforms text-PRMs on benchmarks with up to 5000x FLOPs reduction, 37x latency reduction, and 34x memory reduction.

Text-based PRMs re-encode entire trajectories, costing O(L²) scoring complexity.
KV-PRM uses KV cache to score with a single verify token, achieving O(L) complexity.

AI Model Co-Design: Hardware-Friendly LLM Design

2026-07-12 19:35 UTC

AI performance depends on three dimensions: accuracy, throughput, and interactivity. This post focuses on throughput and interactivity, examining how model-design choices can optimize both without sacrificing accuracy, aiming to push the Pareto frontier outward.

Three dimensions of AI performance: accuracy, throughput, interactivity.
Deployments must balance all three; high accuracy is wasted if responses are slow.

What happens between entering the prompt and seeing the first word appear

2026-07-12 00:28 UTC

An exploration of the inference process in large language models, covering autoregressive generation, prefill and decode phases, the KV cache, and decoding strategies, explaining the mechanics behind token-by-token output.

Inference in LLMs is autoregressive: tokens are generated one at a time, each step depending on previous outputs.
The process splits into a fast prefill phase (processing the entire prompt in parallel) and a slower decode phase (generating tokens sequentially).

Zero-copy TLS ingress with kTLS and splice(2) for sandboxes

2026-07-10 15:46 UTC

Tensorlake rebuilt sandbox ingress, moving from L7 reverse proxy to L4 byte forwarding using kernel TLS (kTLS) and splice(2) for zero-copy data paths, achieving 2.2x throughput and halving CPU cost. The new architecture decouples the data plane from the control plane, uses kTLS for in-kernel crypto, and derives adaptive timeouts from byte flow. Performance tests show single-connection throughput increases from 1.12 GB/s to 2.50 GB/s, with proxy CPU per GB dropping from 0.90 to 0.49 CPU-seconds.

Tensorlake replaced L7 reverse proxy with L4 byte forwarding, eliminating HTTP parsing and userspace buffering.
Uses kernel TLS (kTLS) and splice(2) for zero-copy, with encryption/decryption done in the kernel.

Real-time dental image verification with Amazon SageMaker AI at Henry Schein One

2026-07-10 15:33 UTC

Henry Schein One developed Image Verify, an AI-powered system on Amazon SageMaker AI that evaluates dental X-ray quality in real time, reducing insurance claim denials. The system scaled from concept to over 10,000 locations in months, processing millions of X-rays with sub-2-second latency.

Up to 20% of dental insurance claims are initially denied due to poor image quality.
Image Verify provides real-time quality scores (1-5) at the point of capture, enabling immediate retakes.

Deploying quantized models on Amazon SageMaker AI with Unsloth

2026-07-10 15:26 UTC

Learn four deployment patterns for deploying Unsloth-quantized models on AWS: using EC2 for direct access, SageMaker AI for managed serving, and EKS/ECS for containerized inference. Understand Unsloth's dynamic quantization, model formats (GGUF, safetensors), and operational best practices.

Unsloth dynamic quantization reduces model size by up to 86% with minimal accuracy loss by allocating higher precision to sensitive layers.
Four deployment patterns are covered: EC2 for testing, SageMaker AI for managed endpoints, and EKS/ECS for containerized environments.

Inference Cost

Related topics

Inference Cost updates

Closing the Loop: Training-Free Revisit Consistency for Autoregressive Generative Rendering

Black Forest Labs Releases FLUX 3: A Multimodal Flow Model for Image, Video, Audio and Robot Action Prediction

Datalab’s Marker 2 vs MinerU, Docling and LiteParse: 76.0 on olmOCR-bench at 5× MinerU’s Throughput

Claude Opus 5 arrives with near Fable performance at half the price

Break Through the Compression Bottleneck: From Theory to Practice

DecodeShare: Tracing the Shared Subspace of LLM Decode-Time Decisions

InferenceBench: A Benchmark for Open-Ended LLM Inference Optimization by AI Agents

DC-Leap: Training-Free Acceleration of dLLMs via Draft-Guided Contiguous Leaping Decoding

“We love the world where we can use both”: How NVIDIA thinks about local and frontier models

VQ-Transplant: Efficient VQ-Module Integration for Pre-trained Visual Tokenizers

NEXUS: Structured Runtime Safety for Tool-Using LLM Agents

Benchmarking Confidential GPU Inference on NVIDIA H100 under Intel TDX

Intelligent Multi-UAV Navigation in ITNTNs: A Hierarchical LLM Approach

Recti-Q: Feature-Space Rectification for Out-of-Distribution-Robust Quantized Perception in Edge Robotics

Surprise Forcing: What to Remember, When to Skip in Long Video Generation

Beyond Accuracy and Cost: Latency-Aware LLM Query Routing for Dynamic Workloads

Hardware Mechanisms to Dynamically Throttle AI Performance

Agent swarms are great for local AI

Validating Distributed LLM Serving Benchmarks with NVIDIA srt-slurm, SLURM Recipes, Parameter Sweeps, and Pareto Analysis

Google’s Gemini 3.6 Flash targets enterprise agent token costs

NVIDIA Vera Rubin Driving Performance Per Watt, Lowest Token Cost for Partners Worldwide

Google ships 3 new Gemini models. Just not the one everyone’s waiting for.

Nativ: Run AI models locally on your Mac

Fully-sensorized smart-eyewear platform for on-device Machine Learning

Deterministic Replay for AI Agent Systems

Kimi K3 open-weight model: China’s biggest AI is a bet on memory, not compute

KDnuggets Weekly Roundup: Week of July 13, 2026

Meta’s Spark Muse 1.1 is now available on Databricks, fully governed by Unity AI Gateway

NVIDIA AI Releases Nemotron 3 Embed: An Open Embedding Collection Whose 8B Checkpoint Ranks #1 on RTEB

Polestar: Drift-Aware Cache Calibration and Token Commitment for Efficient Inference of Diffusion LLMs

A 3DGS-Driven Dynamic Viewpoint and Vibrotactile Framework for Subsea Teleoperation Validated via fNIRS

Transforming LLMs into Efficient Cross-Encoders via Knowledge Distillation for RAG Reranking

Semidirect Fourier Delta Attention: Phase-Controlled Delta Memory with Constructive Chunk-WY Kernels

OS –> Prod Survey

12 Ways to Reduce LLM Latency and Inference Costs in Production

Maximizing Human Efficiency in Large-Scale Robot Post-Training via VLAC-Cut Guided Pipeline

Silent Failures in Quantized LLM Reasoning: A Taxonomy-Based Analysis of Hollow Convergence and Failure Mode Shifts

Workload-Driven Optimization for On-Device Real-Time Subtitle Translation

MawForge: Memory-Bounded Expert Materialization for Local Mixture-of-Experts Inference

Closed-Loop Control with Rule-Aligned Small Language Models and Multi-Agent Self-Correction

[AINews] Codex usage up >10x in 6 months to 7M users, +1M in the past ~day; did Codex overtake Claude Code??

Show HN: PlanWright – A control plane for AI coding agents

Director: Accelerating Distributed MoE Serving via Online Proactive Expert Placement

Signed Symmetric Quantization for Few-Bit Integers

KV-PRM: Efficient Process Reward Modeling via KV-Cache Transfer for Multi-Agent Test-Time Scaling

AI Model Co-Design: Hardware-Friendly LLM Design

What happens between entering the prompt and seeing the first word appear

Zero-copy TLS ingress with kTLS and splice(2) for sandboxes

Real-time dental image verification with Amazon SageMaker AI at Henry Schein One

Deploying quantized models on Amazon SageMaker AI with Unsloth

More growth tags

AI Coding

MCP

Open Source Models

Agent Frameworks

China AI

GPU Infrastructure

Model Pricing

DeepSeek

Qwen