This paper presents a portable, low-power, battery-operated vision-based fall prediction and detection system using human pose estimation on an AMD Kria K26 SOM. The system uses an Intel RealSense D455 camera and a three-stage pipeline (quantized YOLOX, A2J, and CNN) to achieve real-time, privacy-preserving fall detection on the edge. Results show 4.5 FPS throughput with 75.85% classification accuracy.
Privacy-preserving fall detection system implemented on AMD Kria K26 edge device
Three-stage pipeline: YOLOX for human detection, A2J for joint estimation, CNN for fall classification
Arbor is a multi-agent framework introducing structured tree search as a cognition layer for autonomous agents in large stateful action spaces. Validated on full-stack LLM inference optimization, it achieves up to 193% Pareto improvement in throughput-latency over vendor baselines, with a critic agent ensuring stability.
Arbor uses tree search as shared working memory across agents for coordinated optimization.
Achieves up to 193% throughput-latency Pareto improvement on full-stack LLM inference, hardware-agnostic.
Diffusion Language Models (DLMs) suffer from bidirectional attention causing failure of existing KV caching methods, leading to near-zero accuracy. The proposed bidirectional prefix caching (bicache) dynamically identifies safe layer depths to reuse shared prefix KVs, improving throughput by 36.3%-98.3% with only 0-1.8% accuracy degradation.
Existing LLM prefix caching fails in DLMs due to bidirectional attention that dynamically alters context and KVs.
Bicache leverages the observation that shared prefix KVs remain stable in shallow layers, with depth determined by the fraction of shared prefix tokens.
Xiaomi's MiMo team, with TileRT, released MiMo-V2.5-Pro-UltraSpeed, a serving mode for the MiMo-V2.5-Pro model. It decodes over 1000 tokens per second on a 1-trillion-parameter model using a single 8-GPU commodity node. The speedup comes from FP4 quantization, DFlash speculative decoding, and the TileRT runtime. API trial runs June 9–23, 2026.
1T-parameter MoE model achieves 1000+ tokens/sec on commodity GPUs
Three coordinated techniques: FP4 quantization, DFlash speculative decoding, TileRT runtime
Diffusion Large Language Models (dLLMs) refine tokens iteratively but commit them irreversibly, causing a "stability lag" where early decisions remain fragile. Post-Training Quantization (PTQ) error easily flips these borderline decisions at the write frontier, locking them in. FAIR-Calib, a two-stage PTQ framework, probes a full-precision teacher for a position prior and performs off-policy layer-wise calibration with a reweighted hidden-state MSE, protecting fragile frontier states without expensive end-to-end rollouts. Theoretically justified as a surrogate for output KL divergence, FAIR-Calib outperforms baselines on LLaDA and Dream (W4A4), reducing frontier flips and post-commit mismatches.
Diffusion LLMs suffer from stability lag where early token decisions are fragile to quantization error
FAIR-Calib introduces a two-stage PTQ framework with frontier-aware instability reweighting
We present Accelerated Fourier SAT (AFSAT), a GPU-accelerated solver for pseudo-Boolean satisfiability based on continuous local search (CLS). AFSAT realises the proof-of-concept approach, FastFourierSAT, into a fully-engineered solver supporting any heterogeneous mixture of symmetric constraint types and lengths within a single problem instance. Using the JAX compiler, AFSAT leverages pure function composition, automatic vectorisation, automatic differentiation, and just-in-time (JIT) compilation to perform massively parallel CLS across batches of candidate assignments. We demonstrate substantially improved numerical stability, runtime performance, and memory efficiency over the proof-of-concept. We achieve this by way of identifying and addressing various limitations that arise from memory latency and floating-point representation, as well as leveraging automatic parallelisation and compact representations. The inherent representational and stability limitations of floating point are partially addressed by a tailored discrete Fourier transform implementation. We achieve near-linear throughput when scaling to multiple accelerators via JAX array sharding.
AFSAT is a GPU-accelerated pseudo-Boolean SAT solver using continuous local search, improving upon FastFourierSAT.
It leverages JAX for massive parallelization via function composition, vectorization, differentiation, and JIT compilation.
This article details the configuration and memory calibration required to run the Qwen 3.6 35B MoE model at a 450,000 token context window on a single 32GB VRAM GPU (NVIDIA RTX 5090) using llama.cpp with TurboQuant and YaRN scaling. It covers model selection, quantization trade-offs, KV cache quantization, RoPE scaling, multimodal setup, replication guide, VRAM lifecycle management, and performance evaluation.
Run Qwen3.6-35B-A3B-Q6_K on a single RTX 5090 with 450K context using llama.cpp TurboQuant fork and YaRN scaling.
Achieve 450K context by compressing KV cache to 3-bit (turbo3) and extending RoPE beyond native 262K with YaRN, but at cost of perplexity and retrieval accuracy.
Roblox acquired Morpheus AI and formed Roblox Labs, then quietly released a world model game called World Research Station which received a 3% rating due to poor performance, high latency, and glitches. The article criticizes Roblox for rushing out immature technology that may harm the field's reputation.
Roblox acquired Morpheus AI and formed Roblox Labs to showcase AI world models.
Quietly released World Research Station, which suffered from terrible performance and earned a 3% rating.
Google DeepMind released Quantization-Aware Training checkpoints for Gemma 4, targeting edge devices and consumer GPUs. This comparison of BF16, Q4_0 QAT, and the new mobile QAT format focuses on memory footprint, quality preservation, and deployment suitability using published data.
Q4_0 QAT reduces E2B memory from 9.6 GB (BF16) to 3.2 GB, and E4B from 15 GB to 5 GB.
The new mobile QAT format brings E2B to ~1 GB; text-only goes under 1 GB.
Google releases new Gemma 4 checkpoints optimized with Quantization-Aware Training (QAT) to reduce memory usage and enable local deployment on edge devices and consumer GPUs. The models include a custom mobile quantization format that cuts memory footprint to 1GB for the E2B model.
QAT integration during training minimizes quality loss from compression.
Custom mobile quantization schema includes static activations, channel-wise quantization, targeted 2-bit quantization, and embedding/KV cache optimization.
NVIDIA introduces Dynamo Snapshot, a checkpoint/restore approach using CRIU and cuda-checkpoint to drastically reduce cold-start latency for AI inference workloads on Kubernetes, achieving startup times from minutes to seconds with optimizations including KV cache unmapping, parallel memfd restore, Linux native AIO, and GPU Memory Service.
Dynamo Snapshot eliminates cold-start delays by checkpointing and restoring inference worker state on Kubernetes.
Optimizations include KV cache unmapping, parallel memfd restore, Linux native AIO, and GPU Memory Service (GMS).
LANTERN is a lightweight memory layer that proactively archives conversation turns and restores details after compaction via hybrid retrieval, requiring zero LLM calls and <25ms latency per turn. It recovers 78.3% of lost facts, outperforming MemGPT, and improves accuracy of production LLMs by 8.4 percentage points on average.
LANTERN is a zero LLM-call memory layer with <25ms latency per turn, recovering lost details after context compaction.
On 94 real conversations, LANTERN-Rerank recovers 78.3% of verifiable facts, outperforming MemGPT's 72.4%.
NVIDIA has released Nemotron 3 Ultra, a 550B total (55B active) open Mixture-of-Experts hybrid Mamba-Transformer for long-running agents. It pairs a 1M-token context with up to ~6x higher inference throughput than comparable open LLMs at on-par accuracy, and ships with open weights, training data, and recipes under OpenMDW-1.1.
NVIDIA releases Nemotron 3.5 Content Safety, a unified model combining multimodal input, multilingual coverage, custom enterprise policy enforcement, and auditable reasoning for content safety. Built on Google Gemma 3 4B IT and fine-tuned with LoRA, it supports explicit training in 12 languages with zero-shot generalization to ~140 languages. New features include custom policy enforcement via natural language specifications and a THINK mode for auditable step-by-step reasoning. The model achieves ~85% average accuracy across multiple multilingual and multimodal safety benchmarks while maintaining a compact 4B-parameter size and low latency. NVIDIA also releases a safety dataset with multimodal, multilingual safety reasoning traces.
When AI inference costs threatened Mate Security's runway, CEO Asaf Wiener didn't just cut costs—he restructured the company so that every backend engineer owns model selection, evaluation, and routing for their workloads. This shift from cloud-era opacity to workload-level cost visibility has enabled quality-cost optimization, with open-source models sometimes outperforming frontier APIs on specific tasks. Wiener argues that an AI-native company's only structural advantage is shipping against the best model available that day, enabled by an 'execution mode' culture that avoids legal-policy review cycles and hires for adaptability.
Wiener broke down AI inference cost into ~10 sub-lines for workload-level visibility, projecting per-feature cost before shipping.
Every backend engineer at Mate runs evals on their workloads, choosing models based on quality and cost, updated continuously.
NVIDIA introduces Nemotron 3.5 ASR, a 600M-parameter streaming multilingual speech-to-text model supporting 40 language-locales with low latency, high accuracy, and built-in punctuation and capitalization. The article details how to fine-tune the model for specific languages, domains, or accents, showing significant WER reductions for Greek and Bulgarian as examples.
Nemotron 3.5 ASR is a single-checkpoint streaming multilingual model supporting 40 language-locales.
It uses a Cache-Aware FastConformer-RNNT architecture for low latency and high accuracy.
arXiv:2606.04050v1 Announce Type: new
Abstract: Existing quantization methods are fundamentally limited by rigid, integer-based bit-widths (e.g., 2, 3-bit), resulting in a ``deployment gap" where Large Language Models cannot be optimally fitted to specific memory budgets. To bridge this gap, we introduce LiftQuant, a novel framework that enables continuous bit-width control for true Pareto-optimal deployment. The core innovation is a ``lift-then-project" mechanism which approximates low-dimensional weight vectors by projecting a simple 1-bit lattice from a higher-dimensional ``lifted" space. Crucially, the effective bit-width is determined simply by the ratio of the lifted dimension to the original dimension, which allows the bit-width to be tuned quasi-continuous as the dimension is a flexible structural parameter. This projection generates a structured yet non-uniform codebook, capturing the expressive power of Vector Quantization (VQ). While beneficial over VQ, LiftQuant's decoding path relies solely on linear transformations and 1-bit uniform quantizers, retaining hardware-friendly nature. This flexibility is transformative: LiftQuant enables a 70B LLM to be compressed to 2.4 bits to precisely fit a 24GB GPU, where its performance significantly surpasses state-of-the-art 2-bit models fitted on the same device. Our code and ckpt is available at https://github.com/Heliulu/LiftQuant.
Existing quantization methods are limited to integer bit-widths, causing a deployment gap.
LiftQuant introduces a lift-then-project mechanism for continuous bit-width control.
A systematic study of projection sharing in the query-key-value (QKV) attention mechanism of transformers, showing that sharing key and value projections (Q-K=V) reduces KV cache by 50% with only 3.1% perplexity degradation. Combining with grouped-query attention (GQA) or multi-query attention (MQA) achieves 87.5% and 96.9% cache reduction, respectively, enabling practical on-device inference. Experiments span synthetic tasks, vision, and language modeling. Code is publicly available.
Systematically evaluates three projection sharing constraints: Q-K=V, Q=K-V, and Q=K=V.
Q-K=V achieves 50% cache reduction with only 3.1% perplexity increase in language modeling.
DeepLearning.AI and Red Hat offer a free intermediate course on efficient LLM inference with vLLM, taught by Cedric Clyburn. The course covers quantization, serving with vLLM, and benchmarking, with 9 video lessons, 3 code examples, and a quiz.
Learn to apply quantization to reduce model memory footprint and measure accuracy tradeoffs
Serve models with vLLM using continuous batching, PagedAttention, and prefix caching
Nvidia unveils the Groq 3 LPU, its first chip dedicated to AI inference, featuring an SRAM-based architecture for ultra-low latency. The chip, which incorporates technology licensed from Groq, works alongside Vera Rubin GPUs to optimize performance through inference disaggregation, signaling a shift toward inference-focused computing in the AI industry.
Nvidia announces Groq 3 LPU, its first inference-specific chip, using a linear data flow architecture with on-chip SRAM.
The chip achieves 150 TB/s memory bandwidth, seven times that of the Vera Rubin GPU, enabling low-latency token generation.
AURA-Mem proposes a constant-size recurrent memory for robot policies that writes only when an observation would change the next action, drastically reducing memory writes while maintaining accuracy. It uses a learned gate trained on action-error signal, achieving fixed 4,224-byte inference state vs growing KV-cache. Experiments show matching success rates with up to 7x fewer writes.
AURA-Mem replaces KV-cache with a fixed-size (4,224 bytes) recurrent memory for robot policy inference.
Action-gating mechanism determines writes based on whether the observation affects future actions, reducing writes by 5-7x.
This article examines the limitations of the Transformer architecture and introduces liquid models as a promising alternative for low-latency, private on-device intelligence.
Transformer's global attention leads to high memory and compute costs during inference.
Liquid models use dynamics instead of attention, offering efficiency for real-time and edge scenarios.
Long-context decoding in LLMs is constrained by memory bandwidth for fetching KV cache. Existing methods prune keys before decoding, ignoring joint key-value dependence. ART is a lightweight run-time mechanism that tracks accumulated attention outputs and terminates KV block accesses when contributions become negligible. It is orthogonal to existing methods and achieves 20% higher throughput on LongBench with comparable accuracy.
ART is a lightweight run-time mechanism that dynamically terminates unnecessary KV cache accesses by tracking attention outputs.
It is orthogonal to existing key-based KV cache management methods and can be seamlessly integrated.
DAStatFormer is a hybrid multibranch Transformer that uses statistical features from multiple domains to classify DAS events efficiently, achieving 99.4% accuracy with fewer parameters and lower inference cost.
Extracts 24 ANOVA-selected attributes per channel from temporal, waveform, and spectral domains.
Uses step-wise and channel-wise attention branches fused via adaptive gating.
BitsMoE is an efficient quantization framework for Mixture-of-Experts (MoE) large language models. It uses SVD to decompose each MoE layer into a shared basis and expert-specific spectral factors, preserving the shared basis without quantization to maintain cross-expert structure. An integer linear programming formulation minimizes reconstruction loss under a fixed bit budget. Experiments show that BitsMoE significantly reduces accuracy degradation in ultra-low-bit regimes, achieving 12.3× quantization speedup, 27.83 percentage point average accuracy improvement, and 1.76× decoding speedup over GPTQ on Qwen3-30B-A3B-Base at 2 bits.
Proposes BitsMoE, which leverages SVD decomposition of MoE layers for fine-grained quantization.
Uses integer linear programming for activation-aware mixed-precision bit allocation to minimize reconstruction loss.
Together AI optimizes MiniMax M3 serving with KV-block-major sparse attention, paged MSA decode, optimized index scoring, and a Rust-based multimodal gateway, achieving 81–125% throughput improvements across concurrency levels.
MiniMax M3 combines coding, agentic workflows, and multimodal reasoning with a 1M-token context window.
Together AI's kernel team developed KV-block-major sparse attention and integrated MSA with paged attention.
This post explores how combining Amazon FSx for Lustre, NVIDIA GPUDirect Storage, and sharded parallel loading reduces cold-start time-to-first-token for large language models from minutes to seconds, and how TurboQuant KV cache significantly increases context window size.
CPU-based model loading is a cold-start bottleneck, taking 10–20 minutes for a 405B model.
FSx for Lustre with GPUDirect Storage enables direct GPU HBM loading via EFA, bypassing CPU.
QVAC SDK 0.12.0 introduces TurboQuant, a KV-cache quantization algorithm that reduces context memory consumption by up to 5x, enabling full 262K-token contexts on consumer GPUs. It works without model retraining and is based on Google Research's ICLR 2026 paper.
TurboQuant compresses KV cache from 16-bit to ~3-bit with minimal accuracy loss.
Low-VRAM devices like RTX 5060 can now handle full 262K context.
SANA-Streaming is a system-algorithm co-designed framework for high-resolution real-time streaming video editing on consumer GPUs. It features a Hybrid Diffusion Transformer, Cycle-Reverse Regularization, and efficient system co-design, achieving 24 FPS at 1280x704 on an RTX 5090. Experiments show significant improvements in temporal coherence and throughput over state-of-the-art.
Hybrid Diffusion Transformer uses softmax attention in select blocks to enhance local modeling while preserving linear layer efficiency.
Cycle-Reverse Regularization enforces temporal consistency via flow matching without requiring paired long videos.
Trajectory, working with UC Berkeley Sky Lab and Anyscale, built a concurrent multi-LoRA training stack for continual learning. It maps each RL experiment to a dedicated LoRA adapter on an always-hot engine, reporting a 2.81× end-to-end experiment-throughput gain over a single-tenant baseline with no reward regression. The code is open-sourced in NovaSky-AI/SkyRL.
Trajectory introduces C-LoRA, a concurrent multi-LoRA training stack achieving 2.81× experiment-throughput gain.
Each experiment uses a dedicated LoRA adapter on a warm engine, leveraging vLLM multi-LoRA inference for concurrency.
This article analyzes the memory bottleneck in AI hardware, particularly during LLM inference. It covers approaches at the chip level (Groq, Cerebras, MatX, d-Matrix), inference engines (RadixArk, Inferact), KV cache infrastructure (TensorMesh/LMCache), and packaging/interconnect (CoWoS). The key insight: the market is a stack of memory problems, and durable companies need to own a control point that cannot be internalized elsewhere in the stack.
Modern GPU tensor throughput far outpaces HBM bandwidth, causing underutilization during decode
Solutions target memory at chip, engine, cache, and packaging levels
Overline is a Chrome extension that provides real-time AI captions and live translation for any browser video, with sub-second latency and no need for subtitles.
Real-time AI captions and live translation for browser videos
Together AI built the fastest speech-to-text stack on Artificial Analysis by treating ASR as a full-path systems problem, not just a GPU inference problem. This article details optimizations including TensorRT multi-profile encoders, conditional CUDA graphs, shared memory, evented I/O, and gc.freeze() to eliminate tail latency.
Together AI achieved fastest STT by optimizing the entire system path, not just GPU inference.
Key techniques: TensorRT multi-profile encoders, conditional CUDA graphs, zero-copy shared memory, and evented I/O.
Azercell Telecom collaborated with the AWS Generative AI Innovation Center to build an Azerbaijani LLM on Amazon SageMaker AI, achieving 23% higher training throughput, 58% lower peak GPU memory, and 2× token efficiency via custom tokenizer, FSDP, and Liger Kernel optimizations.
Azercell developed a production-ready Azerbaijani LLM framework using Amazon SageMaker AI.
Custom tokenizer reduced tokens per word from 3.22 to 1.59, doubling encoding efficiency.
Perplexity AI open-sourced a Rust reimplementation of their Unigram tokenizer, achieving 5x lower latency than Hugging Face's tokenizers crate and reducing CPU utilization by 5-6x in production. The optimizations include double-array trie, bitmap packing, and huge pages.
Perplexity AI rewrote the Unigram tokenizer in Rust, achieving 5x lower p50 latency vs Hugging Face tokenizers crate.
Three optimizations: double-array trie, bitmap and cache-line packing, and huge pages.
This tutorial builds a complete pgvector playground in Google Colab, covering installation, embedding creation, HNSW indexing, semantic search, filtered search, distance metric comparisons, half-precision storage, binary quantization, sparse vector search, hybrid retrieval, and vector aggregation. All using open-source tools without external API keys.
Set up PostgreSQL with pgvector extension in Google Colab from scratch.
Generate embeddings with SentenceTransformers and build HNSW indexes for efficient search.
SCALE-COMM is a self-supervised framework that decouples communication learning from policy optimization, learning compact, stable, and policy-relevant latent messages to improve coordination in multi-agent reinforcement learning. It outperforms existing methods on benchmarks and a realistic warehouse task, offering better stability, sample efficiency, and throughput.
Decouples communication learning from policy optimization to reduce interference.
Uses contrastive learning to enforce consistency across agents and time.
This paper proposes novel techniques for inter-utterance style interpolation and intra-utterance style transition in prompt-based TTS models, addressing limitations of coarse global control. Methods include direction vector interpolation and KV-cache swapping with sliding-window attention masking. Experiments show high success rates in gender conversion and smooth style transitions within utterances.
Inter-utterance interpolation via direction vectors between contrastive style prompts enables smooth transitions.
Intra-utterance transition uses KV-cache swapping and sliding-window masking to overcome attention bias.
This paper presents $E^3$-Agent, an executable and evolving agent for resource management of edge AIGC. It separates a fast-path router from a slow-path LLM meta-controller, learns online from execution feedback, and adapts to unknown time-varying service-time mappings. Evaluation shows 65%-73% latency reduction over static baselines and effective stutter suppression.
Edge generative inference faces unknown per-device performance and non-stationarity.
$E^3$-Agent uses a dual-path architecture: fast router + slow LLM meta-controller.
Soro is a family of Tajik-specialized conversational LLMs built on Gemma 3, using 1.9B token Tajik continual pretraining and 40K instruction tuning examples. It substantially outperforms same-size Gemma 3 on Tajik benchmarks while retaining English performance. FP8/INT4 quantization preserves gains for edge deployment. An education pilot is underway in Tajikistan.
Based on Gemma 3, with 1.9B token Tajik continual pretraining and 40K instruction tuning examples.
Substantially outperforms same-size Gemma 3 on Tajik benchmarks, retains English performance.
At Databricks, we’ve built a unique inference platform that serves every frontier model, from open source to proprietary, powering some of the largest agentic applications. Serving over 120T tokens per month, we tackle challenges of reliability and latency through abstractions like model units for capacity management, cost-aware load balancing and autoscaling that save over 80% GPU costs, and runtime reliability mechanisms including black-box health checks that detect silent failures. Profiling multimodal bottlenecks unlocked 3x throughput gains.
Databricks' inference platform serves frontier models including open source and proprietary, handling 120T tokens/month.
Model units provide a VM-like abstraction for capacity management, enabling cost-aware routing and scaling.
The EAGLE team, vLLM team, and TorchSpec team have jointly released EAGLE 3.1 to fix speculative decoding instability in production LLM serving. The algorithm addresses attention drift through two architectural improvements: FC normalization and post-norm hidden-state feedback. Benchmarks show up to 2× longer acceptance length in long-context tasks and 2.03× per-user throughput on Kimi K2.6 at concurrency 1. EAGLE 3.1 is backward-compatible with EAGLE 3 checkpoints and has been merged into vLLM main, shipping in v0.22.0.
EAGLE 3.1 fixes attention drift, where the draft model gradually shifts focus from context tokens to its own generated tokens during deep speculation.
Two architectural fixes: FC normalization to stabilize hidden states, and feeding normalized states back to the next step.
This tutorial demonstrates how to use zeroentropy/zerank-2-reranker, a 4B Qwen3-based cross-encoder reranker, to enhance retrieval quality. It covers environment setup, pairwise scoring, model.rank usage, a two-stage retrieve-and-rerank pipeline, NDCG@10 evaluation, cross-domain testing in finance, legal, and code, and batched throughput measurement.
RED is a real-time scheduling framework for multi-task deep neural network workloads on resource-constrained robotic platforms. It adapts to runtime environmental changes by assigning intermediate sub-deadlines, leveraging MIMONet weight sharing, and reconstructing computation graphs. Implemented on NVIDIA Jetson and Apple M-series platforms, RED consistently outperforms existing methods in throughput, deadline satisfaction, robustness, adaptability, and overhead.
RED assigns intermediate sub-deadlines to accommodate evolving computation graphs and asynchronous inference.
It leverages MIMONet's shared parameters to improve schedulability through workload refinement and graph reconstruction.
ActQuant is an action-guided mixed-precision post-training quantization framework for Vision-Language-Action (VLA) models, enabling sub-4-bit weight quantization through a two-stage approach that maintains high success rates on the LIBERO benchmark and a real UR3 robotic arm, significantly reducing memory footprint.
ActQuant employs action-aware mixed-precision quantization to preserve VLA model performance under sub-4-bit weight quantization.
The two-stage framework includes an inter-tensor bit allocator and an intra-tensor scale optimizer focusing on action-critical weights.
AERIC is a lightweight safety monitor that reads hidden states during decoding to anticipate implicit harmful content without extra forward passes. With only 387 trainable parameters, it outperforms larger models on multiple benchmarks and increases latency by just 2.34%.
AERIC predicts harmful content early by analyzing the model's internal hidden states.
Combines short-horizon hazard forecasting, support-sensitive suppression, and prompt-conditioned residual scoring.
This paper analyzes the fundamental tradeoffs among latency, reliability, and cost in LLM-enabled agentic workflows. It introduces performance models using a parametric exponential reliability function for LLM agents and proposes a water-filling token allocation policy under latency and cost constraints.
LLM agentic workflows involve tradeoffs among latency, reliability, and cost.
A parametric exponential reliability function models LLM agent performance.
Together AI has released OSCAR (Offline Spectral Covariance-Aware Rotation), an INT2 KV cache quantization method for long-context LLM serving. Unlike prior rotation-based approaches that apply data-oblivious Hadamard transforms, OSCAR derives separate rotations for keys and values from attention-aware covariance structures estimated offline. At 2.28 bits per KV element, OSCAR reduces the BF16 accuracy gap to 3.78 points on Qwen3-4B-Thinking-2507 and 1.42 points on Qwen3-8B, while delivering approximately 8× KV memory reduction and up to 3× decode speedup at 100K context length.
OSCAR is a 2-bit KV cache quantization method using attention-aware rotations that maintain near-BF16 accuracy.
It derives rotations from query and value covariances via offline calibration, directing quantization noise to attention-insensitive directions.
One month after DeepSeek V4's release, the open-source community unveiled Reasonix, a tool specifically designed to minimize API costs by maximizing cache efficiency. It achieves a staggering 99.82% cache hit rate, reducing a $61 bill for 400M+ tokens to just $12.
Reasonix is a dedicated coding harness for DeepSeek, focusing on cost reduction.
Its cache-first loop, tool-call repair, and automatic context compression maintain over 90% cache hit rate in long sessions.