2025-05-15 17:58 UTCIn-site rewrite5 min readUpdated: 2026-06-27 00:25 UTC

DeepSeek-V3 New Paper is coming! Unveiling the Secrets of Low-Cost Large Model Training through Hardware-Aware Co-design

A new 14-page technical paper from DeepSeek-V3 team, co-authored by CEO Wenfeng Liang, explores hardware-aware model co-design to overcome scaling challenges. It details innovations like Multi-head Latent Attention (MLA), DeepSeekMoE, FP8 training, and node-aware routing to achieve cost-efficient large-scale training and inference.

SourceSynced ReviewAuthor: Synced

A newly released 14-page technical paper from the team behind DeepSeek-V3, with DeepSeek CEO Wenfeng Liang as a co-author, sheds light on the “Scaling Challenges and Reflections on Hardware for AI Architectures.” This follow-up to their initial technical report delves into the intricate relationship between large language model (LLM) development, training, and the underlying hardware infrastructure. The paper moves beyond the architectural specifics of DeepSeek-V3 to explore how hardware-aware model co-design can effectively address the limitations of current hardware, ultimately enabling cost-efficient large-scale training and inference.

https://arxiv.org/pdf/2505.09343

The rapid scaling of LLMs has exposed critical bottlenecks in current hardware architectures, particularly concerning memory capacity, computational efficiency, and interconnect bandwidth. DeepSeek-V3, trained on a cluster of 2048 NVIDIA H800 GPUs, serves as a compelling case study demonstrating how a synergistic approach between model design and hardware considerations can overcome these limitations. This research focuses on the interplay between hardware architecture and model design in achieving economical large-scale training and inference, aiming to provide actionable insights for efficiently scaling LLMs without compromising performance or accessibility.

Key areas of focus in the paper include:

Hardware-Driven Model Design: Analyzing how hardware characteristics, such as FP8 low-precision computation and scale-up/scale-out network properties, influence architectural choices within DeepSeek-V3.

Hardware-Model Interdependencies: Investigating how hardware capabilities shape model innovation and how the evolving demands of LLMs drive requirements for next-generation hardware.

Future Directions for Hardware Development: Drawing practical insights from DeepSeek-V3 to guide the co-design of future hardware and model architectures for scalable and cost-effective AI systems.

DeepSeek-V3’s Design Principles: Addressing Core Scaling Challenges

DeepSeek-V3 incorporates several key architectural innovations, as illustrated in Figure 1 of the paper, including the DeepSeekMoE architecture and Multi-head Latent Attention (MLA). These designs directly tackle the core challenges of scaling LLMs: memory efficiency, cost-effectiveness, and inference speed.

Memory Efficiency: MLA and KV Cache Optimization

LLMs exhibit exponential growth in memory demands, outpacing the slower growth of high-speed memory like HBM. While multi-node parallelism offers a solution, optimizing memory usage at the source remains crucial. DeepSeek addresses this bottleneck with Multi-head Latent Attention (MLA), which employs projection matrices to compress the key-value (KV) representations of all attention heads into a smaller latent vector, trained jointly with the model. During inference, only this compressed latent vector needs to be cached, significantly reducing memory consumption compared to storing full KV caches for each head.

Beyond MLA, DeepSeek highlights other valuable techniques for KV cache size reduction, providing inspiration for future advancements in memory-efficient attention mechanisms:

Shared KV (GQA; MQA): Multiple attention heads share a single set of key-value pairs, drastically compressing storage.

Window KV: Limiting the context window for KV caching.

Quantization Compression: Reducing the precision of stored KV values.

Table 1 in the paper compares the per-token KV cache memory footprint of DeepSeek-V3, Qwen-2.5 72B, and LLaMA-3.1 405B. DeepSeek-V3 achieves a remarkable reduction, requiring only 70 KB per token, significantly lower than LLaMA-3.1 405B’s 516 KB and Qwen-2.5 72B’s 327 KB.

Cost-Effectiveness: DeepSeekMoE for Sparse Computation

For sparse computation, DeepSeek developed DeepSeekMoE, an advanced Mixture-of-Experts (MoE) architecture (Figure 1, bottom right). MoE models offer two key advantages in terms of cost-effectiveness:

Reduced Training Compute: By selectively activating a subset of expert parameters per token, MoE architectures allow for a substantial increase in the total number of parameters while maintaining manageable computational demands. For instance, DeepSeek-V3 boasts 671B parameters, nearly three times that of its predecessor V2 (236B), yet only activates 37B parameters per token. In contrast, dense models like Qwen2.5–72B and LLaMa3.1–405B require all parameters to be active during training. Table 2 demonstrates that DeepSeekV3 achieves comparable or superior performance to these dense models with an order of magnitude less computational cost (around 250 GFLOPS per token vs. 394 GFLOPS for the 72B dense model and 2448 GFLOPS for the 405B dense model).

Advantages for Personal Use and Local Deployment: The selective activation of parameters in MoE models translates to significantly lower memory and compute requirements during single-request inference. DeepSeek-V2 (236B parameters), for example, only activates 21B parameters during inference, enabling near or above 20 tokens per second (TPS) on AI SoC-equipped personal computers — a capability far exceeding that of similarly sized dense models on comparable hardware. This opens possibilities for personalized LLM agents running locally.

Enhanced Inference Speed: Overlapping Computation and Communication

DeepSeek prioritizes both system-level maximum throughput and single-request latency for inference speed. To maximize throughput, the model employs a dual micro-batch overlapping architecture from the outset, intentionally overlapping communication latency with computation.

Furthermore, DeepSeek decouples the computation of MLA and MoE into distinct stages. While one micro-batch performs part of the MLA or MoE computation, the other concurrently executes the corresponding scheduling communication. Conversely, during the second micro-batch’s computation phase, the first micro-batch undertakes the combine communication step. This pipelined approach enables seamless overlap of all-to-all communication with continuous computation, ensuring full GPU utilization. In production, DeepSeek utilizes a prefill and decode separation architecture, assigning large-batch prefill and latency-sensitive decode requests to different-sized expert-parallel groups, maximizing system throughput under real-world serving conditions.

The paper also touches upon the importance of test-time scaling for reasoning models and highlights the critical role of high token output speed in reinforcement learning workflows and for reducing user-perceived latency in long inference sequences. Optimizing inference speed through hardware-software co-innovation is therefore paramount for the efficiency of reasoning models.

Low-Precision Driven Design: FP8 Training and LogFMT

FP8 Mixed-Precision Training

While quantization techniques like GPTQ and AWQ have significantly reduced memory requirements primarily for inference, DeepSeek has pioneered the use of FP8 mixed-precision training for a large-scale MoE model. Despite NVIDIA’s Transformer Engine supporting FP8, DeepSeek-V3 marks a significant step as the first publicly known large model to leverage FP8 for training. This achievement, resulting from close collaboration between infrastructure and algorithm teams, along with extensive experimentation, significantly reduces computational costs while maintaining model quality, making large-scale training more feasible. Figure 1 illustrates the FP8 precision used in the forward and backward passes during training.

LogFMT for Efficient Communication

DeepSeek also employs low-precision compression for network communication within the DeepSeek-V3 architecture. During EP parallelism, tokens are scheduled using fine-grained FP8 quantization, reducing communication volume by 50% compared to BF16, thereby significantly shortening communication time.

Beyond traditional floating-point formats, DeepSeek experimented with a novel data type called LogFMT-nBit (Logarithmic Floating-Point Formats).

Interconnect-Driven Design: Addressing Hardware Limitations

Current Hardware Architecture and its Constraints

DeepSeek currently utilizes the NVIDIA H800 GPU SXM architecture (Figure 2), which, while based on the Hopper architecture similar to the H100, features reduced FP64 compute performance and NVLink bandwidth (400 GB/s down from 900 GB/s in H100) due to regulatory requirements. This significant reduction in intra-node scaling bandwidth poses challenges for high-performance workloads. To compensate, each node is equipped with eight 400G Infiniband (IB) CX7 network interface cards (NICs) to enhance inter-node scaling capabilities.

Hardware-Aware Parallelization and Model Co-design

To navigate the limitations of the H800 architecture, the DeepSeek-V3 model incorporates hardware-aware design considerations for parallelization, including: avoiding Tensor Parallelism (TP), enhancing Pipeline Parallelism (PP), and accelerating Expert Parallelism (EP). Specific details of these strategies are available in the original paper.

A key aspect of model co-design is “node-aware routing” for the TopK expert selection strategy in the MoE architecture. Given the approximately 4:1 bandwidth difference between intra-node (NVLink, ~160 GB/s effective) and inter-node (IB, ~40 GB/s effective per NIC) communication, DeepSeek designed the routing to leverage the higher intra-node bandwidth. By grouping the 256 routing experts (4 per GPU in an 8-node, 64-GPU setup) into 8 groups of 32 experts, each residing on a single node, and algorithmically ensuring that each token is routed to at most 4 nodes, DeepSeek mitigates the IB communication bottleneck and improves effective communication bandwidth during training. Tokens destined for experts on the same node can be sent via IB once and then forwarded via NVLink, reducing redundant IB traffic.

Scale-Up and Scale-Out Convergence: Future Hardware Directions

While node-aware routing reduces bandwidth demands, the bandwidth disparity between NVLink and IB complicates the implementation of communication-intensive kernels. Currently, GPU Streaming Multiprocessors (SMs) handle both network message processing and data forwarding via NVLink, consuming significant compute resources. DeepSeek advocates for integrating intra-node (scale-up) and inter-node (scale-out) communication into a unified framework.

Integrating dedicated co-processors for network traffic management and seamless forwarding between NVLink and IB domains could reduce software complexity and maximize bandwidth utilization. Hardware support for dynamic traffic deduplication could further optimize strategies like DeepSeek-V3’s node-aware routing. DeepSeek also explores emerging interconnect protocols like Ultra Ethernet Consortium (UEC) and Ultra Accelerator Link (UALink), noting the Unified Bus (UB) as a recent approach to converging scale-up and scale-out. The paper details methods for achieving this convergence at the programming framework level, including unified network adapters, dedicated communication co-processors, flexible forwarding and broadcast/reduce mechanisms, and hardware synchronization primitives.

Bandwidth Contention and Latency

Another limitation of current hardware is the lack of flexibility in dynamically allocating bandwidth between different traffic types on NVLink and PCIe. For instance, transferring KV cache data from CPU memory to GPUs during inference can saturate PCIe bandwidth, leading to contention with inter-GPU EP communication via IB, potentially degrading overall performance and causing latency spikes. DeepSeek suggests solutions including dynamic NVLink/PCIe traffic prioritization, I/O chiplet integration, and CPU-GPU interconnect within the scale-up domain.

Large-Scale Network-Driven Design: Multi-Plane Fat-Tree

Network Co-design: Multi-Plane Fat-Tree

For DeepSeek-V3 training, a Multi-Plane Fat-Tree (MPFT) scale-

[truncated for AI cost control]