2026-05-03站内改写

The Biggest Regret of DeepSeek V4

DeepSeek V4's technical report introduced many innovations but notably lacked Engram, a conditional memory module jointly open-sourced by DeepSeek and Peking University in January 2026. Engram acts as a native lookup table for Transformers, separating static knowledge retrieval from deep reasoning, which improves efficiency and reasoning performance. Although absent from V4, three subsequent papers explored Engram's potential in CXL memory pooling, collision-free hot-layer optimization, and vision tasks.

Article intelligence

EngineersAdvanced

Key points

DeepSeek V4 omitted Engram, a highly anticipated conditional memory module.
Engram uses hash-based lookup for static knowledge, freeing up network capacity for advanced reasoning.
Three follow-up works applied Engram to CXL memory pooling, collision-free hot layers, and visual domains.

Why it matters

This matters because deepSeek V4 omitted Engram, a highly anticipated conditional memory module.

Technical impact

May affect model selection, inference cost, product capability, and evaluation benchmarks.

DeepSeek V4's technical report is packed with innovations: mHC, CSA, HCA, Muon, FP4. Yet one glaring omission stood out to the community: Engram. This conditional memory module, co-released by DeepSeek and Peking University in January 2026, was widely expected to be the architectural foundation of V4. Its absence left many feeling that V4 was incomplete.

So what exactly is Engram? At its core, Engram is a native knowledge lookup table for Transformers. The key insight behind it is that language modeling involves two fundamentally different tasks: compositional reasoning (requiring deep dynamic computation) and static knowledge retrieval. Previously, Transformers conflated these, wasting layers on reconstructing facts that could be looked up. For example, to recognize "Diana, Princess of Wales," the model had to traverse six layers, gradually piecing together features—a costly process. Engram bypasses this by inserting lookup modules between layers 2 and 15 of the Transformer. Each position triggers a hash lookup that maps the current token and its preceding N-gram context to an embedding table, directly retrieving the relevant vector. A gating mechanism ensures irrelevant matches are filtered. This approach treats Engram as a separate sparse axis from MoE: MoE sparsifies computation; Engram sparsifies storage.

The paper's core experiment fixed total parameters and per-token activation, then allocated budget between MoE experts and Engram memory. The result was a U-shaped curve: pure MoE was not optimal; allocating 20-25% of sparse parameters to Engram minimized loss. Scaling Engram to 27B parameters with 3.8B activated, trained on 262B tokens, validated the approach. On knowledge-heavy tasks, gains were as expected: MMLU +3.4, CMMLU +4.0. But improvements on reasoning, code, and math exceeded expectations: BBH +5.0, ARC-Challenge +3.7, HumanEval +3.0, MATH +2.4. Long-context performance jumped dramatically: Multi-Query NIAH rose from 84.2% to 97.0%. Why does a memory module boost reasoning? Analysis via LogitLens and CKA showed that representations in Engram-27B's 5th layer resembled those in the MoE baseline's 12th layer. Engram freed early layers from reconstructing static knowledge, effectively deepening the network.

Engineering-wise, a 100-billion-parameter Engram table sits in host DRAM, with only 2.8% throughput loss on an H800 for an 8B-dense model. This is thanks to deterministic indexing, enabling CPU prefetching to overlap with GPU computation.

Although Engram didn't make it into V4, three follow-up papers emerged within months:

**CXL Memory Pooling** (March 10): A collaboration between Peking University, Alibaba Cloud, and others proposed moving Engram into a shared CXL memory pool. Eight servers share a 4TB pool, achieving end-to-end throughput loss under 5%. The deterministic addressing of Engram makes it an ideal fit for CXL.

**Collision-Free Hot-Tier Experiment** (January 23): Researcher Tao Lin tested whether eliminating hash collisions for high-frequency N-grams using minimal perfect hashing would improve performance. Surprisingly, the collision-free design did not yield stable loss improvements under iso-parameter settings. The counterintuitive result suggests that some naive optimizations may not work.

**Visual Tiny Engram**: The AutoArk team extended Engram to vision tasks. After reproducing the text version based on Qwen-3, they applied Engram to Stable Diffusion. Compared to LoRA, Engram achieved equivalent results with only 15-30% of the parameters, and without the concept degradation seen in LoRA when injecting multiple new concepts.

These developments show that while V4 missed Engram, its principles continue to influence future work. The original Engram repository on GitHub hasn't been updated since January 14, but the community is actively exploring its potential. As the Engram paper concluded: "We believe conditional memory will be an indispensable modeling primitive for the next generation of sparse models." Perhaps that generation will be V5—or maybe V4.1.