AI News HubLIVE
站内改写3 min read

Tether brings TurboQuant to QVAC SDK, its local AI engine

QVAC SDK 0.12.0 introduces TurboQuant, a KV-cache quantization algorithm that reduces context memory consumption by up to 5x, enabling full 262K-token contexts on consumer GPUs. It works without model retraining and is based on Google Research's ICLR 2026 paper.

SourceHacker News AIAuthor: qvac

If you have ever pasted a long document into a local AI app and watched the model stop mid-page with “context limit exceeded”, you have hit the memory ceiling that has shaped local AI for years. The model wasn’t the bottleneck. The memory aka Key-Value cache was.

QVAC SDK 0.12.0 changes that.

What is the KV cache?

The KV cache is the working memory an LLM keeps during a conversation. Every token of your prompt, every previous assistant turn, every attached document is stored as Key-Value pairs on-device. This cache lets the model maintain coherence across long contexts without reprocessing everything from scratch on each token.

The trade-off: the cache grows linearly with context length and model depth. A Qwen3.5-4B at 262K tokens stores roughly 8 GB of KV data in 16-bit precision. That is twice the size of the Q8 weights themselves. The KV cache, not the model, is what blows past your VRAM.

Local AI has two memory walls. First, the model weights have to fit on your device: too big and you can’t run it at all. Once they fit, the KV cache becomes the second wall: it caps how much context you can hold. TurboQuant attacks the second wall.

What changes for your app in SDK 0.12.0

TurboQuant compresses the KV cache from 16 bits down to roughly 3 bits per value while preserving accuracy across long-context benchmarks. The practical effect:

GPUVRAMKV budget (VRAM − 4.3GB)Context before 0.12.0With TurboQuant

RTX 50608 GB3.7 GB~120K tokens262K tokens (full)

RTX 507012 GB7.7 GB~250K tokens262K tokens (full)

RTX 509032 GB27.7 GB~262K tokens (already full)262K tokens

AMD Ryzen AI Max+ 395 / Strix Halo128 GB123.7 GB~262K tokens (already full)262K tokens

Estimates assume a 4B model at Q8 quantization. Real ceilings depend on the model size and other memory consumers on the device.

Note: These figures do not account for the computation buffer (temporary tensors allocated during inference), so they are approximate estimates.

The table above shows how all hardwares benefit from Turboquant:

Devices with low VRAM are now able to increase their maximum context size

Devices with high VRAM are saving total memory space thanks to a reduced KV budget

What this unlocks in practice:

Local coding assistant with full codebase in context

Long-document analysis (legal contracts, research papers, codebases)

Local 4B+ model with 200K+ context on a single consumer-grade GPU

On-prem enterprise inference for HIPAA / GDPR workloads on a dedicated AI server

How to use TurboQuant in your app

Update to SDK 0.12.0:

npm install @qvac/sdk@latest

To enable TurboQuant on any model you load, pass the turboquant flag in your parameters. That is it.

Currently, TurboQuant is supported only for AMD & NVIDIA GPUs, support for iOS, Android & Apple Silicon coming next.

Why this matters

The context ceiling has, in practice, been an access ceiling. If you could afford a cloud API, you had no KV cache problem. Server farms have effectively unlimited memory. Long context was a feature you bought.

If you wanted to run AI on a device you actually own, where your data stays local, you hit the wall.

TurboQuant narrows that gap. The same model files you already use gain six times more memory headroom on the device you already own. More devices become capable of running real workloads. More people get direct access to intelligence that lives on their own hardware, not in a data center they will never see.

Frequently Asked Questions

What is TurboQuant?

TurboQuant is a KV-cache quantization algorithm published by Google Research at ICLR 2026 (Zandieh et al.). It reduces the running context memory of an LLM by up to 5x with no measurable accuracy loss across major long-context benchmarks.

Does TurboQuant reduce model accuracy?

No. The QVAC team validated TurboQuant across four long-context benchmarks (LongBench, ZeroSCROLLS, RULER, L-Eval, NIAH) with LLama, Qwen and Mistral models. Nearly no accuracy loss was reported across all five. More details here.

Do I need to retrain my model to use TurboQuant?

No. TurboQuant is data-oblivious. It works with any standard transformer loaded as GGUF in the QVAC SDK without retraining, calibration, or fine-tuning.

Is TurboQuant automatic in SDK 0.12.0 or do I have to opt in?

Opt-in. Pass the TurboQuant flag when you load the model. Without it, the default KV cache behavior is used.

Does TurboQuant compress my model file?

No. It only compresses the KV cache during inference. Your GGUF file size is unchanged. The compression happens in memory at runtime.

Get started

Update the QVAC SDK:

npm install @qvac/sdk@latest

Want the technical breakdown of how TurboQuant works at the algorithm level? Read here

Sources

QVAC Turboquant benchmarks: https://github.com/tetherto/qvac-fabric-llm.cpp/blob/master/docs/turboquant-benchmarks.md

TurboQuant paper (Zandieh et al., ICLR 2026): Google Research blog

QVAC SDK release notes: https://docs.qvac.tether.io/reference/release-notes/