2026-05-05 23:27 UTCIn-site rewrite3 min readUpdated: 2026-06-27 00:25 UTC

DeepSeek API Introduces Context Caching on Disk, Cutting Prices by an Order of Magnitude

DeepSeek's new disk-based context caching slashes API costs by up to 90% for repeated inputs. Cache hits cost $0.014 per million tokens. The feature works automatically and is especially beneficial for multi-turn conversations, data analysis, and long prompts. First token latency drops from 13s to 500ms for 128K prompts.

SourceDeepSeek News

DeepSeek API introduces Context Caching on Disk, cutting prices by an order of magnitude | DeepSeek API Docs

DeepSeek API Docs

English

中文（中国）

DeepSeek Platform

Quick Start

Your First API Call

Models & Pricing

Token & Token Usage

Rate Limit

Error Codes

Agent Integrations

API Guides

Thinking Mode

Multi-round Conversation

Chat Prefix Completion (Beta)

FIM Completion (Beta)

JSON Output

Tool Calls

Context Caching

Anthropic API

API Reference

News

DeepSeek-V4 Preview Release 2026/04/24

DeepSeek-V3.2 Release 2025/12/01

DeepSeek-V3.2-Exp Release 2025/09/29

DeepSeek V3.1 Update 2025/09/22

DeepSeek V3.1 Release 2025/08/21

DeepSeek-R1-0528 Release 2025/05/28

DeepSeek-V3-0324 Release 2025/03/25

DeepSeek-R1 Release 2025/01/20

DeepSeek APP 2025/01/15

Introducing DeepSeek-V3 2024/12/26

DeepSeek-V2.5-1210 Release 2024/12/10

DeepSeek-R1-Lite Release 2024/11/20

DeepSeek-V2.5 Release 2024/09/05

Context Caching is Available 2024/08/02

New API Features 2024/07/25

Other Resources

FAQ

Change Log

News

Context Caching is Available 2024/08/02

On this page

DeepSeek API introduces Context Caching on Disk, cutting prices by an order of magnitude

In large language model API usage, a significant portion of user inputs tends to be repetitive. For instance, user prompts often include repeated references, and in multi-turn conversations, previous content is frequently re-entered.

To address this, DeepSeek has implemented Context Caching on Disk technology. This innovative approach caches content that is expected to be reused on a distributed disk array. When duplicate inputs are detected, the repeated parts are retrieved from the cache, bypassing the need for recomputation. This not only reduces service latency but also significantly cuts down on overall usage costs.

For cache hits, DeepSeek charges $0.014 per million tokens, slashing API costs by up to 90%1.

Hint 1: The API price has been updated. For details, please refer to Models & Pricing.

How to Use DeepSeek API's Caching Service

The disk caching service is now available for all users, requiring no code or interface changes. The cache service runs automatically, and billing is based on actual cache hits.

Note that only requests with identical prefixes (starting from the 0th token) will be considered duplicates. Partial matches in the middle of the input will not trigger a cache hit.

Here are two classic cache usage scenarios:

Multi-turn conversation: The next turn can hit the context cache generated by the previous turn.

Data analysis: Subsequent requests with the same prefix can hit the context cache.

Beneficial Scenarios for Context Caching on Disk:

Q&A assistants with long preset prompts

Role-play with extensive character settings and multi-turn conversations

Data analysis with recurring queries on the same documents/files

Code analysis and debugging with repeated repository references

Improve model output performance through Few-shot learning.

...

For more detailed instructions, please refer to the guide Use Context Caching.

Monitoring Cache Hits

Two new fields in the API response's usage section help users monitor cache performance:

prompt_cache_hit_tokens：Number of tokens from the input that were served from the cache ($0.014 per million tokens)

prompt_cache_miss_tokens: Number of tokens from the input that were not served from the cache ($0.14 per million tokens)

Reducing Latency

First token latency will be significantly reduced in requests with long, repetitive inputs.

For a 128K prompt with high reference, the first token latency is cut from 13s to just 500ms.

Lowering Costs

Users can save up to 90% on costs with optimization for cache characteristics.

Even without any optimization, historical data shows that users save over 50% on average.

The service has no additional fees beyond the $0.014 per million tokens for cache hits, and storage usage for the cache is free.

Security Concerns

The cache system is designed with robust security strategy.

Each user's cache is isolated and logically invisible to others, ensuring data privacy and security.

Unused cache entries are automatically cleared after a period, ensuring they are not retained or repurposed.

Why DeepSeek Leads with Disk Caching

Based on publicly available information, DeepSeek appears to be the first large language model provider globally to implement extensive disk caching in API services.

This is made possible by the MLA architecture in DeepSeek V2, which enhances model performance while significantly reducing the size of the context KV cache, enabling efficient storage on low-cost disks.

DeepSeek API’s Concurrency and Rate Limits

The DeepSeek API is designed to handle up to 1 trillion tokens per day, with no limits on concurrency or rate, ensuring high-quality service for all users. Feel free to scale up your parallelism.

The cache system uses 64 tokens as a storage unit; content less than 64 tokens will not be cached.

The cache system does not guarantee 100% cache hits.

Unused cache entries are automatically cleared, typically within a few hours to days.

DeepSeek-V2.5: A New Open-Source Model Combining General and Coding Capabilities

DeepSeek API Upgrade

How to Use DeepSeek API's Caching Service

Monitoring Cache Hits

Reducing Latency

Lowering Costs

Security Concerns

Why DeepSeek Leads with Disk Caching

DeepSeek API’s Concurrency and Rate Limits

WeChat Official Account

Community

Discord

Twitter

GitHub