2026-06-03 20:32 UTCIn-site rewrite4 min readUpdated: 2026-06-30 13:03 UTC

Show HN: Mnemo – local-first AI memory layer for any LLM (Rust, SQLite, petgraph)

Mnemo is a sidecar service that provides persistent memory for LLMs by extracting entities and relationships into a knowledge graph. It works with local or cloud LLMs, runs in under 50ms, and has no cloud dependency. Supports Docker, binary, and Python SDK.

SourceHacker News AIAuthor: zaydmulani

Notifications You must be signed in to change notification settings

Fork 0

Star 0

BranchesTags

Open more actions menu

Folders and files

NameName

Last commit message

Last commit date

Latest commit

History

21 Commits

crates

docs

examples

sdk/python

.dockerignore

.env.example

.gitignore

CONTEXT.md

CONTRIBUTING.md

Cargo.lock

Cargo.toml

Dockerfile

LICENSE

Makefile

README.md

docker-compose.override.yml

docker-compose.yml

mnemo.example.toml

Repository files navigation

Local-first AI memory layer for any LLM. Persistent knowledge graph, entity extraction, semantic retrieval — no cloud required.

What is mnemo?

Most LLMs forget everything the moment a conversation ends. mnemo fixes that.

mnemo is a sidecar service that watches every conversation you feed it, extracts named entities and relationships using an LLM, builds a persistent knowledge graph in SQLite, and injects relevant context back into future prompts — automatically, in under 50ms. It works with Ollama (fully local, free), OpenAI, Anthropic, or any OpenAI-compatible API. It ships as a single static binary with zero cloud dependency.

How it works

your app │ ▼ POST /ingest ──► entity extraction (LLM) ──► knowledge graph (SQLite + petgraph) │ POST /retrieve ◄── scoring + ranking ◄── graph traversal + full-text search │ ▼ context_prompt ──► inject into your LLM prompt

You POST raw text to /ingest (a conversation turn, a document, a note).

mnemo sends it to your configured LLM and extracts entities (people, tools, places, concepts) and the relationships between them.

Entities are deduplicated by name+type, aliases are merged, and everything is written to SQLite. The in-memory petgraph is updated atomically.

On POST /retrieve, mnemo runs a 6-stage pipeline: full-text chunk search → entity name search → graph expansion (BFS over the knowledge graph) → relation filter → score+rank → assemble a context_prompt string.

You inject context_prompt into your LLM's system prompt. Done.

Quickstart

Path A — Docker + Ollama (fully free, recommended)

git clone https://github.com/zaydmulani09/mnemo cd mnemo docker compose up -d

Pull the llama3 model the first time (~4 GB)

docker exec mnemo-ollama ollama pull llama3

Verify everything is healthy

curl http://localhost:8080/health

Path B — Binary (Ollama or OpenAI running separately)

cargo install --path crates/mnemo-api

With Ollama

export MNEMO_LLM_BASE_URL=http://localhost:11434/v1 mnemo-api

With OpenAI

export MNEMO_LLM_BASE_URL=https://api.openai.com/v1 export MNEMO_LLM_API_KEY=sk-... export MNEMO_LLM_MODEL=gpt-4o-mini export MNEMO_LLM_PROVIDER=openai mnemo-api

Path C — Python SDK

pip install mnemo-sdk

from mnemo import MnemoClient

client = MnemoClient() # server at http://localhost:8080

Store a memory

client.ingest("I'm building a Rust vector database called vecdb")

Get context for injection into your next LLM prompt

print(client.get_context("what am I working on?"))

API Reference

All endpoints accept and return application/json. Base URL: http://localhost:8080.

Method Path Description Request body Response

GET /health Server + DB + LLM status — HealthResponse

POST /ingest Store text, extract entities IngestRequest IngestResponse

POST /retrieve Retrieve ranked memory context RetrievalQuery RetrievalResult

GET /entities List entities (paginated) ?limit&offset Entity[]

GET /entities/:id Get entity by UUID — Entity

DELETE /entities/:id Delete entity (cascades) — {"deleted":true}

GET /entities/:id/neighbors Knowledge graph neighbors ?depth (max 5) GraphNode[]

GET /chunks List memory chunks (paginated) ?limit&offset&session_id MemoryChunk[]

GET /chunks/:id Get chunk by UUID — MemoryChunk

DELETE /chunks/:id Delete chunk — {"deleted":true}

POST /search Full-text search entities + chunks {"query","limit"} {"entities","chunks"}

DELETE /wipe Delete all memory (irreversible) header: X-Confirm-Wipe: true {"wiped":true}

GET /stats Entity/chunk/graph counts + uptime — StatsResponse

Key request/response types:

Full endpoint documentation with curl examples: docs/api.md

Configuration

Environment variables

Variable Default Description

MNEMO_DB_PATH mnemo.db SQLite database file path

MNEMO_PORT 8080 API server port

MNEMO_LLM_BASE_URL http://localhost:11434/v1 OpenAI-compatible LLM base URL

MNEMO_LLM_MODEL llama3 Model name for entity extraction

MNEMO_LLM_API_KEY ollama API key (any value works for Ollama)

MNEMO_LLM_PROVIDER ollama Provider type: ollama, openai, anthropic, custom

TOML config file

Pass --config path/to/config.toml to mnemo-api. See mnemo.example.toml:

db_path = "mnemo.db" port = 8080

[llm] provider = "ollama" base_url = "http://localhost:11434/v1" model = "llama3" api_key = "ollama" timeout_secs = 30 max_retries = 3 max_tokens = 2048 temperature = 0.1

Environment variables take precedence over TOML values. The active config source is reported in GET /health → config_source.

CLI

Install:

cargo install --path crates/mnemo-cli

Usage:

Store a memory

mnemo ingest "I use Neovim and prefer dark mode"

Retrieve relevant context

mnemo search "what editor do I use?"

List all extracted entities

mnemo entities

Show entity detail + graph neighbors

mnemo entity --neighbors

List memory chunks

mnemo chunks

Server health

mnemo health

Memory statistics

mnemo stats

Delete everything (prompts for confirmation)

mnemo wipe

Skip confirmation prompt

mnemo wipe --yes

Point at a non-default server

mnemo --server http://192.168.1.10:8080 stats

Python SDK

Install:

pip install mnemo-sdk

See sdk/python/README.md for the full API reference.

Async example:

import asyncio from mnemo import AsyncMnemoClient

async def main(): async with AsyncMnemoClient() as client: await client.ingest( "Alice is a principal engineer at Stripe working on payment infrastructure.", session_id="session-001", ) context = await client.get_context( "what does Alice work on?", session_id="session-001", ) print(context)

asyncio.run(main())

A working standalone example: examples/basic_usage.py

Architecture

Four Rust crates wired together:

Crate Type Role

mnemo-core lib Entity extraction, graph ops, retrieval engine, DB layer

mnemo-api bin Axum REST API — thin handler layer over mnemo-core

mnemo-cli bin CLI tool using blocking reqwest against the API

mnemo-bench bin Performance benchmarks (12 suites)

Full architecture documentation: docs/architecture.md

Performance

Benchmarked on Apple M2, SQLite WAL mode, in-memory petgraph. Debug build numbers — release build (--release) is 3–5× faster.

Operation Avg latency Throughput

Entity insert (SQLite) ~0.12 ms ~8,300 ops/s

Entity lookup by ID ~0.08 ms ~12,500 ops/s

Chunk insert ~0.14 ms ~7,100 ops/s

Full-text chunk search ~0.28 ms ~3,500 ops/s

Graph neighbor (depth=1) ~0.21 ms ~4,700 ops/s

Graph neighbor (depth=2) ~0.89 ms ~1,100 ops/s

Full retrieval pipeline ~4.2 ms ~238 ops/s

Run cargo run -p mnemo-bench to benchmark on your hardware.

Testing

Rust

cargo test --workspace # run all 122 tests make coverage # HTML coverage report (requires cargo-llvm-cov) make coverage-summary # summary to stdout

Python SDK

cd sdk/python && pytest tests/ -v

Benchmarks

cargo run -p mnemo-bench # all 12 benchmarks cargo run -p mnemo-bench -- --filter graph # graph benchmarks only cargo run -p mnemo-bench -- --json out.json # save results to JSON

Current test counts: 122 Rust tests · 21 Python tests · 12 benchmarks

Contributing

PRs welcome. Please run make fmt && make lint before submitting. Open an issue first for large changes.

See CONTRIBUTING.md for full setup instructions, code style guide, and how to add a new LLM provider.

License

MIT — see LICENSE

About

Local-first AI memory layer for any LLM. Persistent knowledge graph, entity extraction, semantic retrieval. Works with Ollama, OpenAI, Anthropic, or any OpenAI-compatible backend.

Resources

Readme

License

MIT license

Contributing

Uh oh!

There was an error while loading. Please reload this page.

Activity

Stars

0 stars

Watchers

0 watching

Forks

0 forks

Report repository

Releases

1 tags

Packages 0

Uh oh!

There was an error while loading. Please reload this page.

Contributors

Uh oh!

There was an error while loading. Please reload this page.

Languages

Rust 79.4%

Python 15.8%

Jupyter Notebook 3.5%

Other 1.3%