2026-05-08 08:20 UTCIn-site rewrite2 min readUpdated: 2026-06-30 13:03 UTC

Redis Creator Builds a Dedicated Inference Engine for DeepSeek V4: ds4.c

Salvatore Sanfilippo (antirez), the creator of Redis, has open-sourced ds4.c, a lightweight inference engine tailored for DeepSeek V4 Flash. It runs efficiently on Apple Silicon Macs using Metal API, achieving up to 27 tokens/s generation on high-end models.

Source量子位Author: henry

Article intelligence

EngineersAdvanced

Key points

Antirez releases ds4.c, a Metal-only inference engine for DeepSeek V4 Flash, optimized for Mac. No other models supported.
Employs asymmetric quantization (2-bit for MoE expert layers, Q8 for others) and disk-based KV caching for speed.
Built-in OpenAI and Anthropic API compatibility enables easy integration with coding agents like Claude Code.
Project sparks debate on model-specific frameworks; antirez advocates for full-stack local inference as a product.

Why it matters

This matters because antirez releases ds4.c, a Metal-only inference engine for DeepSeek V4 Flash, optimized for Mac. No other models supported.

Technical impact

May affect model selection, inference cost, product capability, and evaluation benchmarks.

This panel is AI-generated and reviewed for accuracy.

Salvatore Sanfilippo, widely known as antirez and the creator of Redis, has released a new open-source project called ds4.c — a dedicated inference engine designed exclusively for DeepSeek V4 Flash. The project has already garnered significant attention in the developer community, with users reporting impressive performance on high-end Mac hardware.

DeepSeek V4 Flash, released in late April, is a Mixture-of-Experts (MoE) model with 284 billion total parameters but only 13 billion active per token, supporting a 1-million-token context window. While such models typically require cloud GPU infrastructure, antirez aimed to bring it to local machines, specifically Apple Silicon Macs.

ds4.c is written entirely in C, Objective-C, and Metal — Apple's graphics and compute API. It does not rely on any existing frameworks like llama.cpp or CUDA, focusing solely on Apple Silicon. This narrow focus allows for extreme optimization. According to benchmarks, on a MacBook Pro M3 Max with 128GB RAM and 2-bit quantization, it achieves 58.52 tokens/s for prefill and 26.68 tokens/s for generation. On a Mac Studio M3 Ultra with 512GB, prefill speeds can reach 468 tokens/s for long prompts.

The key technical innovations include asymmetric quantization: only the MoE expert layers are quantized to 2-bit using IQ2_XXS and Q2_K, while shared experts, projection layers, and routing layers remain at Q8 precision. This preserves accuracy where it matters most. Antirez notes that even at 2-bit, the model performs reliably for coding agent tasks.

Another feature is disk-based KV caching. Instead of recomputing the full prefill for each new request, ds4.c caches the KV cache state to disk keyed by a SHA1 hash of the token sequence. Subsequent requests can skip prefill entirely by matching the prefix. This is particularly beneficial for agents like Claude Code that send large initial prompts every session.

To make the engine practical for agent workflows, ds4.c includes native API compatibility layers for both OpenAI and Anthropic protocols, supporting /v1/chat/completions and /v1/messages respectively, along with tool calling. Users can configure agents like Pi or Claude Code to point to the local ds4.c server directly.

The project has ignited discussion about the future of inference frameworks. A popular Hacker News comment suggested that as GPU costs rise, hyper-optimized engines targeting specific hardware-model combinations could become more common. However, the trade-off is that such engines become obsolete when the model changes. Antirez acknowledges this, but emphasizes that the current bet on DeepSeek V4 Flash is a starting point, and the constraints of running locally on high-end personal machines with at least 128GB RAM remain.

What sets ds4.c apart is antirez's philosophy: local inference should be a full-stack product, not a collection of components. He envisions a tightly integrated bundle — an HTTP-enabled inference engine, a custom GGUF quantization tailored for that engine, and a set of verified agent integrations. If this approach succeeds, it could change how local AI deployment works.

Interestingly, antirez states that ds4.c was developed with "heavy assistance" from GPT 5.5, with humans handling ideas, testing, and debugging. He warns that if you are not comfortable with AI-assisted code, this software is not for you. This transparency highlights the growing role of AI in open-source development.