What happens when OKF runs inside an AI tool
Google's Open Knowledge Format (OKF) is a markdown standard for AI knowledge. This article tests the runtime pattern when an OKF bundle is placed inside an AI tool and the model must decide which files to inspect. Results show OKF improves retrieval over naive vector search, but session performance degrades significantly due to scope disambiguation, supersession, and session drift issues. The main takeaway: OKF solves interchange, not selection.
Open Knowledge Format Benchmark: What Happens When OKF Runs Inside an AI Tool | Tenure Install Free
Open Knowledge Format
Open Knowledge Format benchmark: what happens when OKF runs inside an AI tool
Google dropped the Open Knowledge Format as a simple markdown standard for AI knowledge. We wanted to test the part that matters in practice: what happens when that bundle is placed inside an AI tool and the model has to decide which files to inspect.
Tenure research · ~7 min read
TL;DR
This was not a test of markdown as a storage format.
It was a test of the runtime pattern implied by OKF when a bundle is dropped into an AI tool today.
The model received normal tool access to list files and read files. No custom retrieval-agent system prompt.
PrecisionMemBench scored the belief IDs corresponding to the files the model actually read.
OKF did much better than raw semantic memory retrieval, but it still showed the same core issue: files are a format, not a retrieval policy.
Why test it
I like OKF. I just wanted to see what happens when an AI tool actually has to use it.
I like the direction of the Open Knowledge Format. A directory of markdown files with YAML frontmatter is boring in the best way. It is easy to read, easy to diff, easy to commit, and easy for humans to maintain. That is a real advantage over proprietary memory blobs or hidden vendor indexes.
But after reading the spec, the obvious question was not whether markdown is a good interchange format. Markdown is fine. The question was what happens at runtime.
Because the moment an OKF bundle gets placed into an AI tool, somebody still has to decide which files enter the model request. Maybe the model lists the files. Maybe it reads one. Maybe it reads several. Maybe it never opens the file that actually contains the needed belief. The format makes the knowledge portable. It does not, by itself, make retrieval precise.
The thing we tested was not Open Knowledge Format as a storage layer. We tested the access pattern most teams would get if they dropped an OKF bundle into a tool today and let the model inspect it.
The setup
How we ran the OKF bundle
The wrapper was intentionally plain. The model got a normal assistant prompt and two tools: one to list available OKF markdown files, and one to read a specific file. The model was not given a custom retrieval-agent prompt telling it how to behave. It just received a user query and decided whether to inspect the bundle.
When the model called read_file, the wrapper parsed the file's frontmatter, extracted the beliefId, and returned that ID to PrecisionMemBench as a retrieved belief. That is the whole bridge. Files read by the model became retrieved belief IDs. Files not read did not count.
That matters because PrecisionMemBench does not score whether the final answer sounds good. It scores whether the system retrieved the right underlying beliefs. In this run, the question was simple: did model-directed file access pull the right OKF documents into the request?
Single-turn run
0.47 Mean precision
77 cases, 36 passes, 18 active retrieval passes, 0.91 mean recall, and 4.4s mean latency.
Session run
0.17 Pass rate
12 session turns, 2 passes, 1 active retrieval pass, 0.45 mean recall, and 59.3s p95 latency.
The encouraging part
OKF improved the shape of the problem.
The good news is that OKF did not behave like raw vector recall. The model had filenames, titles, descriptions, types, tags, and readable file bodies. That gave it more handles than a cosine search over opaque memory chunks.
Alias resolution was the clearest win. Across 23 alias cases, the run reached 0.72 mean precision and 0.92 mean recall. Some short-form queries worked exactly how you would hope. A query for GHA could lead the model to the GitHub Actions belief. A query for Mongo could lead it to the MongoDB decision. In those cases, the filesystem pattern gives the model a real path to the right document.
Ranking stability also looked strong. Those cases passed cleanly. That is worth saying because it means the result is not a blanket criticism of OKF. When the query maps cleanly to the file surface, markdown files with frontmatter are a perfectly reasonable representation.
The hard part
But file access is still not memory retrieval.
The failures showed up where memory systems usually fail: scope, supersession, type routing, and session drift. These are not markdown problems. They are state problems.
Scope disambiguation had 12 cases and only 4 passed. Mean precision was 0.21. This is the classic Redis problem. If Redis appears in a code context and a writing context, the model has to know which one belongs in the current request. A folder of files can expose both meanings, but it does not enforce which one is valid for this turn.
Supersession was even sharper. The run had three supersession chain exclusion cases, and none passed. Again, that is not because markdown cannot store an old belief and a new belief. It can. The issue is that the runtime has to know which belief is current and which belief must be excluded. A stale file sitting next to a current file is still accessible unless the retrieval layer knows the difference.
Type routing and open questions had 0.20 mean precision. Fuzzy matching and prefix guards landed at 0.25. Budget eviction and capacity landed at 0.13. This is where the difference between a knowledge format and a memory system becomes very obvious. OKF can describe the knowledge. It does not decide how much of that knowledge should enter the model request.
The bundle made knowledge inspectable. It did not make retrieval governed. That is the gap.
Session behavior
The session run is where the pattern got shaky.
The single-turn result is the optimistic view. The session run is closer to how people actually use AI tools. Topics move around. Earlier turns create noise. Some later turns are implicit. The model has to keep deciding whether to inspect files, which files matter, and whether the current turn still belongs to the same topic.
In that run, only 2 of 12 turns passed. Mean recall dropped to 0.45, and there was only 1 active retrieval pass. The p95 latency reached 59.3 seconds. The exact number matters less than the pattern: once retrieval becomes a repeated model behavior inside a session, the cost and instability compound.
This is the part I would watch most closely if I were adopting OKF inside a product. A single query against a clean file bundle can look pretty good. A real session needs continuity, exclusion, recency, scope, and budget control on every turn.
What it means
OKF solves interchange. It does not solve selection.
This is the main takeaway for me. OKF is a good answer to the interchange problem. It gives people a clean way to write portable knowledge bundles. That is valuable. It means a team can move curated context between tools without depending on one vendor's memory database.
But memory retrieval has a second problem: selection. Which beliefs are relevant? Which are current? Which are scoped to this user, team, repo, customer, or task? Which facts are pinned? Which facts are stale? Which open questions should be visible, and which should stay out of the request?
A file format can carry the fields needed to answer those questions. It cannot answer them by merely existing. The runtime has to enforce them.
Bottom line
The format is not the hard part.
Dropping markdown files into a tool is a great start. It is transparent, portable, and easy to reason about. But once the model has to choose which files to read, you are back in retrieval land.
The OKF run did better than naive memory retrieval because files, titles, descriptions, and frontmatter give the model useful handles. But the hard cases remained hard because the hard cases were never about markdown. They were about state.
So I am bullish on OKF as a format. I am not bullish on treating the format as the memory system. The next layer has to be the runtime: the thing that decides what gets retrieved, what gets excluded, what is current, what is scoped, and what the model was actually allowed to know.