HN: Goat 2.0 – proactive episodic memory for AI agents
Goat 2.0 is a Telegram-based AI agent built around a proactive layered memory system. Unlike standard RAG, it retrieves memory before every turn, independent of query content. It features three independent backends (Redis, ChromaDB, Letta), adaptive token scaling, priority-inverted L2/L3 split, and write-through archiving. This project demonstrates how to build an AI assistant with complex memory mechanisms.
Notifications You must be signed in to change notification settings
Fork 0
Star 2
BranchesTags
Open more actions menu
Folders and files
NameName
Last commit message
Last commit date
Latest commit
History
265 Commits
265 Commits
.claude/skills/run-goat2
.claude/skills/run-goat2
.superpowers/sdd
.superpowers/sdd
pycache
pycache
agents
agents
chromadb_data
chromadb_data
config
config
docs/superpowers
docs/superpowers
mcp_server
mcp_server
memory
memory
orchestrator
orchestrator
plugins
plugins
registry
registry
telegram_interface
telegram_interface
test
test
tests
tests
tools
tools
utils
utils
.gitignore
.gitignore
LICENSE
LICENSE
README.md
README.md
SETUP.md
SETUP.md
conftest.py
conftest.py
requirements-dev.txt
requirements-dev.txt
requirements.txt
requirements.txt
Repository files navigation
GOAT 2.0 is a Telegram-based AI agent built around a proactive layered memory system. The core distinction from a standard RAG setup: memory retrieval runs before the LLM responds, on every turn, independent of the query's content. A one-word ambiguous message still triggers a semantic search of past sessions and injects whatever is structurally relevant into the prompt — the model never has to ask "do I remember this?" because retrieval already happened.
The per-turn driver is Orchestrator.run (orchestrator/orchestrator.py). It talks to memory through one façade — MemoryLayers in memory/layers.py — and never imports a physical backend directly.
What makes it different
Proactive, not reactive. The prefetch daemon (Orchestrator._prefetch_daemon) starts as the first step of every turn (asyncio.create_task at orchestrator.py:247) and runs in parallel with the L0/L1/L2 fetch. Retrieval precedes generation; it is not triggered by the model noticing a gap.
Three independent physical backends. Redis (working), ChromaDB (episodic), and Letta (permanent) each connect lazily and fail independently — a Letta outage empties L1 facts and the turn continues (memory/layers.py:80).
No static thresholds. The L3 relevance filter is a ratio over the score distribution, scale-invariant by construction (MemoryLayers._gap_filter, layers.py:353). The context budget adapts per turn from two real signals (memory/aits.py). L3's minimum token slice is guaranteed by construction, not tuned (memory/context_budget.py).
Complete fidelity. Nothing is compressed or extracted. Every turn is archived verbatim (_archive_turn, orchestrator.py:173), so the full text of any past exchange is retrievable from a single semantic query — no summary stands between the model and the original words.
Architecture
Physical backends
Backend Class Storage Key / location
Working WorkingMemory (memory/working/working.py) Redis goat2:working:{chat_id} (messages), cache:{chat_id}:{key} (L2.5)
Episodic EpisodicMemory (memory/episodic/episodic.py) ChromaDB PersistentClient local collection at EPISODIC_STORAGE_PATH (./chroma_data), name episodic_memory
Permanent PermanentMemory (memory/permanent/permanent.py) Letta HTTP API core-memory facts block on agent goat-permanent
All three connect lazily on first use. The Redis client is built in WorkingMemory._get_client; ChromaDB in EpisodicMemory._get_collection; Letta via httpx.AsyncClient in PermanentMemory._get_http. ChromaDB's sync API is bridged with asyncio.to_thread. The episodic collection has no hnsw:space override, so search returns squared-L2 distances in result["score"] (lower = closer).
Logical layers
Layer Content In context Backed by
L0 — Identity Base persona from [identity] base_prompt Always config (memory.toml)
L1 — Facts Curated key→value facts in Letta facts block Always Permanent
L2 — Working Full conversation history for the current chat Always (capped) Working / Redis
L2.5 — Session Cache TTL cache for L3 search results + tool outputs When available Working / Redis
L3 — Episodic Semantic long-term memory; injected when relevant Conditional Episodic / ChromaDB
The mapper (memory/layers.py)
MemoryLayers is the only memory interface the orchestrator and bot touch. It maps the five logical layers onto the three physical tiers and exposes typed methods: assemble_context, get_working_context, save_working_context, search_episodic, search_episodic_with_cache, store_episodic, find_by_keys, bump_access, promote_fact, and the full L2.5 cache API (get_cache, set_cache, invalidate_cache, clear_cache, cache_exists). Neither orchestrator.py nor telegram_interface/bot.py imports WorkingMemory, EpisodicMemory, or PermanentMemory.
Service registry (registry/registry.py)
ServiceRegistry is a lazy DI container that owns every service lifetime — LLM client, the three tiers, MemoryLayers, MemoryAnalytics, and PluginManager. It is a class instance passed by the caller, not a module-level singleton (the zero-singleton rule). Each backend is built on first property access.
The prefetch daemon
This is the core differentiator. It runs on every turn, before the LLM call.
- Started first, in parallel with L0/L1/L2
run() opens with asyncio.create_task(self._prefetch_daemon(chat_id, intent)) (orchestrator.py:247). Immediately after, it starts two more tasks — layers.get_identity_and_facts() (L0+L1) and layers.get_working_context() (L2) at orchestrator.py:250-251 — so the daemon's L3 search overlaps the tier fetches with real concurrency, not sequentially. The AITS classify (CPU, instant) runs during this overlap.
- Three mechanisms, evaluated independently, no gate
The daemon classifies the query three ways and runs every mechanism that scores above zero. There is no confidence gate on whether prefetch runs at all — the timeout is the only blocker. Classification lives in memory/query_classifier.py:
Mechanism Signal Source Retrieval
Temporal A completed-past date range extract_temporal_range (memory/temporal_parser.py, dateparser, STRICT_PARSING) search_episodic(after, before) — filtered semantic search
Thematic Always 1.0 none — unconditional search_episodic_with_cache — cached semantic search (carries cache_key)
Specific-key Structural keys present extract_structural_keys (query_classifier.py:38) — UUID, agent-{uuid}, word+number, turn_/goat: find_by_keys — UUID get-by-id + content $contains, exact match (score=0.0)
The temporal mechanism uses a grammatical date parser, not a keyword list. The "completed-past" rule (temporal_parser.py:59, before = L3_GAP_SIGNIFICANCE: # default 3.0 keep everything before the largest gap else: inject nothing
With fewer than 3 results, an absolute ceiling of 1.5 applies instead (squared-L2 ≈ "nearly orthogonal" under unit-norm MiniLM embeddings). The ratio criterion is scale-invariant: as the archive grows, genuine query clusters produce ratios
10 with no recalibration. Calibrated at 3.0 from 12 labeled queries (2026-06-29): unrelated ratios 2.33–2.76 rejected, genuine ratios 3.13–5.13 passed.
AITS — Adaptive Intent Token Scaling (memory/aits.py)
Every turn computes a dynamic token budget from two signals in the user message:
budget = BUDGET_BASE + confidence × BUDGET_CONFIDENCE_MULTIPLIER + complexity × BUDGET_COMPLEXITY_MAX_BONUS (capped at BUDGET_HARD_CAP)
Confidence (0–1): set-membership over the query's word tokens (split on whitespace, stripped of punctuation — no regex) against two lists — high (interrogative/analytical cues: what, how, why, cum, dece, când, …) and medium (auxiliary verbs). Empty query → 0.2; high cues → 0.8–1.0 scaled by cue count; medium cues → 0.5; any other statement → 0.5. There is no low-confidence list — greetings and short turns default to 0.5, not 0.2.
Complexity (0–1): (len(query) / 200) × 0.7 + connector_bonus × 0.3, capped at 1.0. connector_bonus fires on multi-part connectors (and, or, și, sau, plus, ;, ,).
Default knobs (config/memory.toml [aits]): base 2000, multiplier 4000, complexity bonus 2000, hard cap 12000. A greeting yields ~2000 tokens; a detailed multi-part question approaches 12000.
Priority-inverted L2/L3 split (memory/context_budget.py)
After L0+L1 tokens are subtracted, allocate_context_budget splits the remainder with L3 first:
l3_guarantee = L3_MIN_GUARANTEE_TOKENS # 1200 by default l2_cap = available - l3_guarantee # L2 takes the rest, AITS-scaled
L2 has a floor (L2_FLOOR_TOKENS = 500) that wins only on pathologically small budgets. On every realistic AITS budget, L3 is guaranteed at least 1200 tokens regardless of how long L2 has grown — L2 can no longer eat the whole budget and starve L3 to zero. L3 is fit into the remainder after L0+L1+L2 in assemble_context (layers.py:279); l3_used is how many results fit.
Memory mechanics
Write-through archive (_archive_turn, orchestrator.py:173)
After every turn's L2 save, the orchestrator fires _archive_turn as asyncio.create_task — fire-and-forget, never blocks the response, never raises:
content = f"user: {intent}\nassistant: {reply}" await layers.store_episodic(chat_id, content, tags=["l2_full_archive"])
Every turn lands in L3 verbatim, tagged l2_full_archive to distinguish automatic writes from GOAT's curated store_memory / promote_memory calls. store_episodic seeds access_count=0 and last_accessed_ts on write (layers.py:130-135) so the merge-score terms exist from the first retrieval.
L2 conversation trim (MemoryLayers._trim_recent_messages, layers.py:294)
When L2 history exceeds its cap, messages are dropped oldest-first with one exception: the very first message (the topic-setter) is pinned provided it is small ( list[ToolDefinition] into tools/goat_skills/ to add a tool without a restart.
Observability
One MemoryObservation (memory/observability.py) is emitted as a JSON log line per turn, built by ObservationCollector (memory/observability_collector.py) and aggregated by the registry-owned MemoryAnalytics (memory/analytics.py). Fields include: AITS confidence/complexity/budget, L2.5 cache hit/miss + key, latency per stage (classify / search / assemble / inject / llm / save / total), tokens per tier (L0+L1 / L2 / L3), prefetch outcome (attempted / succeeded / timeout / blocks injected / blocks used), and results found/used. A summary report is logged every ANALYTICS_LOG_INTERVAL requests (default 100).
The intent_category field is a coarse, analytics-only label derived from the confidence tier (observability_collector.py:111): recall (≥ 0.4), greeting (< 0.3), else conversational. It labels the analytics tier and never gates prefetch.
Project layout
goat2/ ├── memory/ │ ├── layers.py # Backend mapper — the only memory interface (L0-L3 + L2.5) │ ├── aits.py # Adaptive Intent Token Scaling (confidence + complexity) │ ├── context_budget.py # Priority-inverted L2/L3 budget split │ ├── result_merger.py # Prefetch merge: dedupe + blended score (0.6/0.3/0.1) │ ├── query_classifier.py # 3-mechanism prefetch classification (temporal/thematic/specific-key) │ ├── temporal_parser.py # dateparser completed-past range extraction │ ├── session_cache.py # L2.5 TTL cache (Redis) │ ├── promote.py # L3 → L1 promotion, cap-guarded │ ├── budget.py # Token estimation + result-count enforcement │ ├── observability.py # MemoryObservation dataclass (per-turn JSON) │ ├── observability_collector.py # Per-turn observation builder + intent category │ ├── analytics.py # Registry-owned metrics aggregator + report │ ├── config.py # Reads config/memory.toml; all numeric knobs │ ├── working/working.py # Redis-backed working memory (L2) │ ├── episodic/ │ │ ├── episodic.py # ChromaDB lifecycle + store/search (L3) │ │ ├── queries.py # find_by_keys, bump_access, get_recent/count/delete (L3 mixin) │ │ └── warmup.py # Collection pre-warm at startup │ └── permanent/permanent.py # Letta-backed permanent memory (L0/L1) ├── orchestrator/ │ ├── orchestrator.py # Per-turn driver: prefetch → AITS → assemble →
[truncated for AI cost control]