Lessons from Shipping Persistent Memory for AI Agents
The journey of building mem9, an agent memory product, revealed that memory is a complex engineering challenge beyond simple storage, requiring precision, user visibility, and continuous evaluation. Starting from a customer request, the team rapidly prototyped and iterated, learning that an API alone is insufficient, and that memory must feel human and extend beyond text to multimodal experiences.
Article intelligence
Key points
- mem9 began as a practical customer request and was validated through a fast prototype before any formal plan.
- Agent memory is not just storage; it's a precision engineering problem involving ingestion, ranking, and evaluation.
- A memory API alone is not a product; users need tools to inspect, trust, and correct what is remembered.
- Evaluation and benchmarking are essential product infrastructure for maintaining memory quality in production.
Why it matters
This matters because mem9 began as a practical customer request and was validated through a fast prototype before any formal plan.
Technical impact
May affect model selection, inference cost, product capability, and evaluation benchmarks.
Key Takeaways
mem9 started as a customer request in March 2026, not a roadmap. We shipped a prototype before we wrote a plan.
Agent memory is not a storage problem. It is an engineering problem at the intersection of ingestion, ranking, evaluation, and product judgment.
A memory API alone is not a product. People want to see, inspect, trust, and correct what an agent remembers.
mem9 runs on TiDB Cloud, the same substrate behind TiDB Cloud Zero.
In early March 2026, a customer asked us for something that sounded simple and turned out to be one of the hardest problems in the agent stack: Make agents remember.
We did not start with a polished roadmap, a heavyweight architecture review, or a six-month product plan. We started the way many products do: With a concrete user pain, a rough prototype, and a very short distance between “this is interesting” and “we need to ship this.”
That was the beginning of mem9.
Looking back now, mem9 feels less like a conventional software project and more like a compressed startup year. What began as a fast customer-driven prototype quickly became a product, then a platform, and then a much deeper exploration of what agent memory actually requires in production. The visible features changed quickly, but the core question stayed the same. How do you help an agent remember what matters, without overwhelming it with everything else?
This is what we have learned so far.
It Started With a Real Problem, Not a Market Thesis
The real beginning of mem9 was not a category map or a strategy deck. It was a customer asking a practical question. If an agent could keep durable memory across sessions, would the user actually feel the difference?
We believed the answer might be yes, but belief was not enough. We needed to make the value obvious, fast.
So we took the shortest path to proof. We built a rough but convincing version, put it in front of a customer, and watched for the reaction. That prototype did exactly what it needed to do. It made the value legible. Once people could see an agent remember something it would normally forget, the conversation changed immediately. We were no longer talking about an interesting capability. We were talking about a product the market was ready for.
That early moment shaped everything that followed. mem9 has always felt like an agent-era product to us because it was born from workflow pain rather than abstract positioning. It was validated almost immediately, and once it was validated, the pace changed. The project stopped behaving like an experiment and started behaving like a startup.
In the first few days, we assembled the core of the system surprisingly quickly: A Go server, memory APIs, TiDB Cloud for storage, search, auth, rate limiting, and the first plugin integrations. Almost immediately after that, support expanded across agent environments such as OpenClaw, OpenCode, and Claude Code, while onboarding improved, multi-tenant foundations landed, and the first mem9.ai site went live. We were not following a neat sequence from infra to product to growth. In reality, all of those tracks were moving at once, because once the value was obvious, hesitation became more expensive than momentum.
Memory Is Not a Storage Feature
Early on, one thing became clear: We were not trying to build “a vector database for agents.” We were trying to build memory that actually improves agent behavior.
That is a small change in framing with very large architectural consequences.
A lot of discussions about agent memory still frame the problem as storage plus retrieval. In practice, that framing is too shallow. The hard part is not whether information can be stored. The hard part is whether the right information comes back at the right time, in the right amount, under real production constraints.
Too little recall and the agent forgets the one detail that matters. Too much recall and the context gets polluted with irrelevant baggage. If recall becomes noisy as the memory corpus grows, trust disappears. So the challenge is not persistence by itself. The challenge is precision.
That insight pushed mem9 very quickly beyond a basic memory store. What started as durable memory soon became a more opinionated system for ingestion, extraction, reconciliation, ranking, and retrieval. We moved toward a server-centric architecture because we wanted integrations to stay thin while the memory logic could evolve centrally. That decision mattered. It let us improve behavior at the core instead of pushing complexity into every plugin or runtime.
This is the part of the category that we think is still underestimated. Memory quality is not mainly a UI problem, and it is not purely a model problem either. It is an engineering problem that sits at the intersection of storage, ranking, evaluation, latency, product judgment, and orchestration. If agents are going to do meaningful work in production, they do not just need more context. They need better context.
An API Is Not a Agent Memory Product
The next lesson came quickly. A memory API alone is not a product.
People do not just want memory to exist. They want to see it, inspect it, trust it, correct it, and eventually shape it. That is what pushed mem9 beyond infrastructure.
The next phase of mem9 was about turning an invisible backend capability into something users could actually experience. We built surfaces that made memory legible: Session views, timeline views, analysis workflows, filters, previews, and insight layers that helped people understand not only what had been remembered, but also why it mattered. That work gradually became “Your Memory,” not just as a UI, but as a way to make long-term memory feel concrete instead of abstract.
On the backend, that shift demanded a different kind of engineering. The work moved toward taxonomy, analysis quality, deduplication, responsiveness, and better report workflows. None of that had the drama of the first sprint, but it was just as important. The first phase proved that memory could work. This phase made it understandable and trustworthy.
At the same time, we were building all the less glamorous pieces that turn curiosity into adoption. The public website, docs, analytics, attribution, contact flows, better onboarding, and eventually API documentation. None of those changes are especially cinematic in a commit log, but they are how real products grow. Growth rarely comes from one dramatic launch. More often, it comes from dozens of small improvements that reduce friction, make the value easier to grasp, and help interested users become active users.
That combination of technical depth and product polish mattered. mem9 moved from a fast prototype to a product people could discover, evaluate, and use seriously. Within a little over two weeks, it had already crossed 10,000 users.
We Made Evaluation Part of the Agent Memory Product
Once users start relying on memory in real workflows, intuition is no longer enough.
“It feels better” is a good starting point, but it is not an operating system. We needed ways to measure whether recall quality was improving, regressing, or simply changing shape. That is what pulled us deeper into benchmarks.
Instead of treating benchmarks as side research, we treated them as product infrastructure. We built evaluation harnesses, adapted older multi-turn datasets into more modern agent settings, and created feedback loops that could guide actual engineering decisions. The point was not to chase performance benchmarks. The point was to make memory quality visible and debuggable.
That distinction mattered even more as mem9 entered more demanding conversations and partnerships, especially around Kimi. Once your system is being evaluated as a serious long-term memory layer for real agent workflows, vague claims stop being useful. You need baselines and evidence, to understand where retrieval works, where it fails, and how changes affect precision, recall, duplication, and evidence quality.
In that sense, benchmarking became less like academic scoring and more like instrumentation for product truth. It helped us move beyond taste and into iteration. It gave us a way to turn “memory feels off” into something diagnosable and improvable.
Agent Memory Has to Feel Human
One of the more interesting lessons in building mem9 was that memory should not remain purely invisible.
The APIs matter. The storage model matters. The ranking logic matters. But users do not experience memory as an index. They experience it as continuity. They care about whether the system feels like it knows them, whether it can reconnect threads over time, and whether that continuity feels trustworthy rather than uncanny.
That is part of why we kept investing in visualization and memory management instead of stopping at an API layer. It is also why some of the most distinctive ideas in mem9 came from product intuition rather than architecture diagrams.
A good example is Memory Farm, our visual memory explorer. On the surface, it looks playful: A pixel-art-inspired interface where memories grow as plants in a garden, clustered by topic and connected by relationship. The underlying instinct is serious. Memory becomes easier to understand when users can see patterns, clusters, history, and relationships in more intuitive forms. If memory is central to how an agent relates to a user, then memory products should not feel cold by default.
That lesson shaped more of the product than we expected. The goal was never just to retrieve facts. It was to help people build trust in a system that remembers on their behalf.
The Category Is Crowded Because the Problem Is Real
From the outside, agent memory can look like a hot category. From the inside, it mostly looks like a long list of hard edge cases.
Large context windows are still finite. Important facts get buried under recent noise. Naive retrieval brings back the wrong things. Repetition wastes tokens. Quality degrades as memory grows. And once recall starts to feel random, users lose confidence very quickly.
mem9 was built inside those problems from day one. That is why the product moved so quickly from raw persistence into ingestion, reconciliation, hybrid retrieval, ranking, analysis, benchmarking, and orchestration. The market attention is real, but it is downstream of a very real product need. Everyone building serious agents runs into the same failures sooner or later.
That is also why ecosystem shifts mattered so much to us. As agent frameworks introduced better lifecycle control around how context is assembled, memory stopped looking like a sidecar and started looking like a core part of the context pipeline. That is the point where the category becomes much more interesting. The best memory system is not the one that stores the most. It is the one that helps an agent decide what should stay, what should surface, and what should remain quiet.
Agent Memory Should Not Stop at Text
As we built mem9, we became increasingly convinced that long-term memory for agents should eventually become much warmer and richer than text-only retrieval.
This became especially vivid in conversations around multimodal use cases. Once you move beyond coding agents and into products built around voice, photos, and video, the meaning of memory changes. A useful memory system should not just retrieve a sentence from years ago. It should be able to retrieve the image, the audio fragment, the interaction, the evidence, and the surrounding context that make the present moment more meaningful.
That direction has shaped a lot of our thinking, especially alongside drive9, our new companion product for files and artifacts. If an agent can accurately bring back not only words but also sounds, images, and other forms of stored experience, memory stops feeling like note-taking and starts feeling much closer to continuity.
That is still an unfolding part of the journey,
[truncated for AI cost control]