How AI memory systems break at scale
This article analyzes four structural failure modes of AI memory systems at scale: cosine similarity's inability to discriminate within a domain, the decoupling of extraction quality from retrieval precision, session drift compounding noise across turns, and latency figures hiding session-level degradation. It proposes alias-weighted BM25 leveraging lexical priming as a replacement for semantic similarity.
How AI Memory Systems Break at Scale | Tenure Install Free
Architecture
How AI memory systems break at scale
The failure modes are structural, not incidental. Similarity search accumulates noise faster than any model can filter it. Here is exactly what breaks, and how we designed around each failure.
Tenure research · ~12 min read
TL;DR
At small scale, frontier models can filter retrieval noise. At thousands of beliefs, that safety net disappears entirely.
Vector similarity cannot discriminate between beliefs that share a domain but differ in relevance. This is a geometry problem, not a capability problem.
Multi-turn sessions compound the failure: beliefs from off-topic turns contaminate re-entry queries with drift scores of 0.92 to 1.0.
Ingestion latency creates a structural availability gap: beliefs introduced mid-session may not be queryable until the session has ended.
The fix is not a better embedding model. Precision across a 20x range in model scale stays at 0.09. The fix is a different retrieval signal.
The hidden assumption
Memory systems are tested at the wrong scale
Every memory system for LLM agents looks adequate in demos and early sessions. The corpus is small, the frontier model is capable, and the model compensates for imprecise retrieval by reasoning through noise. This works until it does not.
The field has converged on benchmarks that operate at tens to low hundreds of beliefs. At that scale, a system that returns its entire store achieves recall of 1.0 and scores competitively on answer-quality metrics, because a capable model can locate the correct answer in a noisy context window. The precision problem is invisible at the scale where everything is tested, and fully visible at the scale where everything breaks.
Serious persistent memory use reaches thousands of beliefs. Full-corpus retrieval becomes architecturally impossible. The precision problem can no longer be offloaded to inference, and the failure that was invisible in evaluation surfaces immediately in production.
The generative model was never a neutral downstream consumer. It was load-bearing infrastructure compensating for retrieval imprecision. That load-bearing role cannot scale with the store.
Failure mode 1
Cosine similarity cannot discriminate within a domain
In any belief store where the user works within a technical domain, all beliefs about that domain occupy a shared semantic region. A query about Redis is semantically close to the Redis belief you want, and equally close to beliefs about MongoDB, TypeScript, Kubernetes, Fastify, and GitHub Actions. Cosine scores across these range from 0.65 to 0.83: genuine semantic relatedness that is measuring the wrong thing.
The predictable response is to reach for a more capable embedding model. We tested three, spanning a 20x range in scale: a 768-dimension model, a 1024-dimension model, and an 8-billion parameter model producing 4096-dimension embeddings. Mean retrieval precision was 0.09 across all three. The qwen3 result is the clearest demonstration that this is not a capability problem. At over 1,100ms mean per query, it produced identical precision to the smallest model.
Embedding model Dimensions Mean precision Active retrieval passes Mean latency
nomic-embed-text 768 0.09 0 / 48 43ms
mxbai-embed-large 1024 0.09 0 / 48 96ms
qwen3-8b 4096 0.09 0 / 48 1,131ms
Precision is invariant to embedding model scale. All 11 total passes in every configuration are structural or trivially empty cases. Zero active retrieval passes across all three models.
A more powerful embedder distributes scores differently across the corpus but cannot eliminate genuine semantic proximity within a domain-specific corpus. The fix is not a better ruler. It is a different measurement instrument entirely.
Failure mode 2
Extraction quality does not predict retrieval precision
One of the more counterintuitive findings from our evaluation is that faithfully extracted beliefs can still fail at retrieval. The extraction pipeline and the retrieval pipeline are architecturally decoupled, and precision failures occur in the retrieval layer regardless of what the extraction layer did.
Consider a concrete case from PrecisionMemBench. A relation-type belief linking an auth service to a Redis dependency was ingested through Mem0's extraction pipeline. The stored memory preserved every operationally significant fact: the service name, the dependency target, the fail-open behavior, and the coupling assertion. High-quality extraction by any measure.
Stored in Mem0 after extraction
User's auth service depends on Redis for session storage. If Redis goes down, auth fails open by denying all requests. Auth resilience discussions must address Redis availability; the two are tightly coupled.
A query asking for auth service dependencies and failure modes returned this belief correctly, then returned 16 additional beliefs including linting configuration, React expertise levels, a Vitest preference, a communication style preference, and a superseded SQLAlchemy belief. Retrieval precision: 0.056. The structurally required participant belief was absent from the result set entirely despite being referenced in the stored text.
The extraction was not the problem. The retrieval layer contaminated the result set with semantically proximate beliefs that had no relevance to the query. Improving extraction quality cannot fix this.
When the query was slightly less specific, one required belief disappeared from the result set entirely. When it was more specific, both required beliefs appeared alongside 16 irrelevant ones. Neither outcome required poor extraction. The precision floor is structural, not query-dependent.
Failure mode 3
Session drift compounds noise across turns
Single-turn retrieval metrics conceal a failure that only becomes visible across a session. Memory is stateful. Beliefs introduced during one turn occupy the same vector space as beliefs from every other turn, and cosine similarity has no mechanism for respecting the temporal or topical boundaries between them.
Our session-level evaluation runs a 10-turn session: a topic is established at turn 0, followed by 8 drift turns across unrelated domains, followed by an implicit return to the original topic at turn 9. The drift score measures what fraction of retrieved beliefs at re-entry originated from off-topic drift turns. A perfect system scores 0.0. Comparison systems score 0.92 to 1.0.
System Turn 9 drift score Turn 10 drift score Cross-session drift
Tenure 0.0 0.0 0.0
Vector baseline 1.0 0.94 0.94
Mem0 1.0 1.0 1.0
Zep 1.0 0.92 0.92
Hindsight 1.0 0.94 1.0
Drift score is the fraction of retrieved non-pinned beliefs originating from off-topic turns at re-entry. 0.0 is perfect isolation. Comparison systems surface noise from unrelated drift turns regardless of re-entry query specificity.
The Hindsight result at turn 10 is worth examining specifically. The cross-encoder reranker bundled in its full image is the architectural feature designed to address exactly this class of problem. At that turn, Hindsight achieves a drift score of 0.94 with the correct belief absent from the result set entirely: not ranked low, but missing. The reranker does not close the gap because the gap is in the cosine geometry the reranker operates on, not in the ranking order.
Failure mode 4
Latency figures hide session degradation
Published latency benchmarks for memory systems almost universally report single-turn figures. Single-turn latency is to session latency as synthetic benchmarks are to production load: a measurement that tells you something useful about a condition that does not exist in practice.
Under session load, retrieval paths that were already imprecise degrade further. One comparison system reports sub-700ms single-turn latency in its published evaluation. Across the 12 session cases in PrecisionMemBench, the same system exceeds 2,700ms mean per session turn, with p95 above 6,000ms.
Single-turn mean
672ms
Hindsight (published)
vs
Session-turn mean
2,736ms
Hindsight (session load)
Ingestion latency creates a separate structural problem. Zep's graph-based write architecture produces read-time latency of 139ms, one of the more competitive single-turn figures among the systems evaluated. It also produces 897 seconds of total ingestion time across a 35-belief corpus, meaning 25,630ms per belief. At a typical conversational turn cadence of 10 to 30 seconds, a belief introduced at turn 1 may not be queryable until the session has largely concluded.
This is not an edge case. A belief is only useful if it is available when needed. A memory system with an availability gap measured in minutes does not solve the re-orientation problem; it defers it.
What we built instead
A different retrieval signal from first principles
Each of these failure modes has the same root cause: cosine similarity is the wrong primary retrieval signal for a bounded vocabulary context where the user coined the terminology. The additional infrastructure layered on top of it, re-rankers, temporal trees, hierarchical graphs, is compensating for the wrong primary signal rather than replacing it.
The correct signal exploits a property of individual language production. Single speakers maintain stable, distinctive lexical choices across production contexts over periods of one to two years. Lexical priming formalizes the mechanism: words become entrained through use, and speakers reliably return to the same lexical choices in the same topical contexts. A single-user belief store is precisely the setting where these properties are strongest: the query author and the belief author are the same person.
If a user named their Kubernetes belief with canonical name kubernetes and aliases k8s and kube, then a query containing k8s should retrieve that belief with high precision regardless of semantic distance. There is no ambiguity to resolve: the authored terminology is the ground truth. Alias-weighted BM25 retrieves what the user named. In a single-user persistent memory context, that is more often correct than what is semantically nearby.
Noise accumulation
Hard scope isolation
Scope is a hard filter, not a ranking signal. A superseded or out-of-scope belief is never a candidate regardless of match quality. Session drift cannot occur structurally.
Vocabulary coverage
Alias enrichment flywheel
Every session is an observation of how the user refers to beliefs in natural language. New surface forms are captured and added to the alias set continuously. Precision improves with use.
Stale context
Supersession chain
Superseded beliefs are retained for audit but never injected. The system can distinguish "we never had this belief" from "we moved past it." Stale context is structurally retired, not probabilistically suppressed.
Noise floor growth
Compaction
The belief store grows monotonically without compaction. Compaction prevents noise floor accumulation over time by merging duplicate and overlapping beliefs while preserving the full alias history of each merged entry.
The predictable objection to BM25 is vocabulary coverage: if a user refers to a belief using a term not yet in the alias set, retrieval fails. This objection is correct as a static description and wrong as a practical one. On first encounter, the system returns silence rather than noise. The extraction worker captures the new term as an alias. Every subsequent query using that term resolves correctly.
The consequence is a precision flywheel that runs in the opposite direction from similarity search. A purely semantic system degrades as the store grows: more beliefs means more semantic mass, broader cosine overlap, and lower precision on every query. Alias-weighted BM25 improves as the store grows: more sessions means more observed surface forms, a richer alias set, and higher precision on the vocabulary that is actually
[truncated for AI cost control]