2026-07-01 19:12 UTCIn-site rewrite5 min readUpdated: 2026-07-01 19:32 UTC

HN: Goat 2.0 – proactive episodic memory for AI agents

Goat 2.0 is a Telegram-based AI agent built around a proactive layered memory system. Unlike standard RAG, it retrieves memory before every turn, independent of query content. It features three independent backends (Redis, ChromaDB, Letta), adaptive token scaling, priority-inverted L2/L3 split, and write-through archiving. This project demonstrates how to build an AI assistant with complex memory mechanisms.

SourceHacker News AIAuthor: takashikiari

Article intelligence

EngineersAdvanced

Key points

Proactive retrieval: memory retrieval runs before the LLM responds on every turn, not triggered by the model noticing a gap.
Three independent backends: Working (Redis), Episodic (ChromaDB), and Permanent (Letta) each connect lazily and fail independently.
Adaptive Intent Token Scaling (AITS): dynamic token budget based on query confidence and complexity.
Complete fidelity: every turn is archived verbatim with no compression or extraction, enabling full semantic retrieval.

Why it matters

This matters because proactive retrieval: memory retrieval runs before the LLM responds on every turn, not triggered by the model noticing a gap.

Technical impact

May affect model selection, inference cost, product capability, and evaluation benchmarks.

This panel is AI-generated and reviewed for accuracy.

Notifications You must be signed in to change notification settings

Fork 0

Star 2

BranchesTags

Open more actions menu

Folders and files

NameName

Last commit message

Last commit date

Latest commit

History

265 Commits

.claude/skills/run-goat2

.superpowers/sdd

pycache

agents

chromadb_data

config

docs/superpowers

mcp_server

memory

orchestrator

plugins

registry

telegram_interface

test

tests

tools

utils

.gitignore

LICENSE

README.md

SETUP.md

conftest.py

requirements-dev.txt

requirements.txt

Repository files navigation

GOAT 2.0 is a Telegram-based AI agent built around a proactive layered memory system. The core distinction from a standard RAG setup: memory retrieval runs before the LLM responds, on every turn, independent of the query's content. A one-word ambiguous message still triggers a semantic search of past sessions and injects whatever is structurally relevant into the prompt — the model never has to ask "do I remember this?" because retrieval already happened.

The per-turn driver is Orchestrator.run (orchestrator/orchestrator.py). It talks to memory through one façade — MemoryLayers in memory/layers.py — and never imports a physical backend directly.

What makes it different

Proactive, not reactive. The prefetch daemon (Orchestrator._prefetch_daemon) starts as the first step of every turn (asyncio.create_task at orchestrator.py:247) and runs in parallel with the L0/L1/L2 fetch. Retrieval precedes generation; it is not triggered by the model noticing a gap.

Three independent physical backends. Redis (working), ChromaDB (episodic), and Letta (permanent) each connect lazily and fail independently — a Letta outage empties L1 facts and the turn continues (memory/layers.py:80).

No static thresholds. The L3 relevance filter is a ratio over the score distribution, scale-invariant by construction (MemoryLayers._gap_filter, layers.py:353). The context budget adapts per turn from two real signals (memory/aits.py). L3's minimum token slice is guaranteed by construction, not tuned (memory/context_budget.py).

Complete fidelity. Nothing is compressed or extracted. Every turn is archived verbatim (_archive_turn, orchestrator.py:173), so the full text of any past exchange is retrievable from a single semantic query — no summary stands between the model and the original words.

Architecture

Physical backends

Backend Class Storage Key / location

Working WorkingMemory (memory/working/working.py) Redis goat2:working:{chat_id} (messages), cache:{chat_id}:{key} (L2.5)

Episodic EpisodicMemory (memory/episodic/episodic.py) ChromaDB PersistentClient local collection at EPISODIC_STORAGE_PATH (./chroma_data), name episodic_memory

Permanent PermanentMemory (memory/permanent/permanent.py) Letta HTTP API core-memory facts block on agent goat-permanent

All three connect lazily on first use. The Redis client is built in WorkingMemory._get_client; ChromaDB in EpisodicMemory._get_collection; Letta via httpx.AsyncClient in PermanentMemory._get_http. ChromaDB's sync API is bridged with asyncio.to_thread. The episodic collection has no hnsw:space override, so search returns squared-L2 distances in result["score"] (lower = closer).

Logical layers

Layer Content In context Backed by

L0 — Identity Base persona from [identity] base_prompt Always config (memory.toml)

L1 — Facts Curated key→value facts in Letta facts block Always Permanent

L2 — Working Full conversation history for the current chat Always (capped) Working / Redis

L2.5 — Session Cache TTL cache for L3 search results + tool outputs When available Working / Redis

L3 — Episodic Semantic long-term memory; injected when relevant Conditional Episodic / ChromaDB

The mapper (memory/layers.py)

MemoryLayers is the only memory interface the orchestrator and bot touch. It maps the five logical layers onto the three physical tiers and exposes typed methods: assemble_context, get_working_context, save_working_context, search_episodic, search_episodic_with_cache, store_episodic, find_by_keys, bump_access, promote_fact, and the full L2.5 cache API (get_cache, set_cache, invalidate_cache, clear_cache, cache_exists). Neither orchestrator.py nor telegram_interface/bot.py imports WorkingMemory, EpisodicMemory, or PermanentMemory.

Service registry (registry/registry.py)

ServiceRegistry is a lazy DI container that owns every service lifetime — LLM client, the three tiers, MemoryLayers, MemoryAnalytics, and PluginManager. It is a class instance passed by the caller, not a module-level singleton (the zero-singleton rule). Each backend is built on first property access.

The prefetch daemon

This is the core differentiator. It runs on every turn, before the LLM call.

Started first, in parallel with L0/L1/L2

run() opens with asyncio.create_task(self._prefetch_daemon(chat_id, intent)) (orchestrator.py:247). Immediately after, it starts two more tasks — layers.get_identity_and_facts() (L0+L1) and layers.get_working_context() (L2) at orchestrator.py:250-251 — so the daemon's L3 search overlaps the tier fetches with real concurrency, not sequentially. The AITS classify (CPU, instant) runs during this overlap.

Three mechanisms, evaluated independently, no gate

The daemon classifies the query three ways and runs every mechanism that scores above zero. There is no confidence gate on whether prefetch runs at all — the timeout is the only blocker. Classification lives in memory/query_classifier.py:

Mechanism Signal Source Retrieval

Temporal A completed-past date range extract_temporal_range (memory/temporal_parser.py, dateparser, STRICT_PARSING) search_episodic(after, before) — filtered semantic search

Thematic Always 1.0 none — unconditional search_episodic_with_cache — cached semantic search (carries cache_key)

Specific-key Structural keys present extract_structural_keys (query_classifier.py:38) — UUID, agent-{uuid}, word+number, turn_/goat: find_by_keys — UUID get-by-id + content $contains, exact match (score=0.0)

The temporal mechanism uses a grammatical date parser, not a keyword list. The "completed-past" rule (temporal_parser.py:59, before = L3_GAP_SIGNIFICANCE: # default 3.0 keep everything before the largest gap else: inject nothing

With fewer than 3 results, an absolute ceiling of 1.5 applies instead (squared-L2 ≈ "nearly orthogonal" under unit-norm MiniLM embeddings). The ratio criterion is scale-invariant: as the archive grows, genuine query clusters produce ratios

10 with no recalibration. Calibrated at 3.0 from 12 labeled queries (2026-06-29): unrelated ratios 2.33–2.76 rejected, genuine ratios 3.13–5.13 passed.

AITS — Adaptive Intent Token Scaling (memory/aits.py)

Every turn computes a dynamic token budget from two signals in the user message:

budget = BUDGET_BASE + confidence × BUDGET_CONFIDENCE_MULTIPLIER + complexity × BUDGET_COMPLEXITY_MAX_BONUS (capped at BUDGET_HARD_CAP)

Confidence (0–1): set-membership over the query's word tokens (split on whitespace, stripped of punctuation — no regex) against two lists — high (interrogative/analytical cues: what, how, why, cum, dece, când, …) and medium (auxiliary verbs). Empty query → 0.2; high cues → 0.8–1.0 scaled by cue count; medium cues → 0.5; any other statement → 0.5. There is no low-confidence list — greetings and short turns default to 0.5, not 0.2.

Complexity (0–1): (len(query) / 200) × 0.7 + connector_bonus × 0.3, capped at 1.0. connector_bonus fires on multi-part connectors (and, or, și, sau, plus, ;, ,).

Default knobs (config/memory.toml [aits]): base 2000, multiplier 4000, complexity bonus 2000, hard cap 12000. A greeting yields ~2000 tokens; a detailed multi-part question approaches 12000.

Priority-inverted L2/L3 split (memory/context_budget.py)

After L0+L1 tokens are subtracted, allocate_context_budget splits the remainder with L3 first:

l3_guarantee = L3_MIN_GUARANTEE_TOKENS # 1200 by default l2_cap = available - l3_guarantee # L2 takes the rest, AITS-scaled

L2 has a floor (L2_FLOOR_TOKENS = 500) that wins only on pathologically small budgets. On every realistic AITS budget, L3 is guaranteed at least 1200 tokens regardless of how long L2 has grown — L2 can no longer eat the whole budget and starve L3 to zero. L3 is fit into the remainder after L0+L1+L2 in assemble_context (layers.py:279); l3_used is how many results fit.

Memory mechanics

Write-through archive (_archive_turn, orchestrator.py:173)

After every turn's L2 save, the orchestrator fires _archive_turn as asyncio.create_task — fire-and-forget, never blocks the response, never raises:

content = f"user: {intent}\nassistant: {reply}" await layers.store_episodic(chat_id, content, tags=["l2_full_archive"])

Every turn lands in L3 verbatim, tagged l2_full_archive to distinguish automatic writes from GOAT's curated store_memory / promote_memory calls. store_episodic seeds access_count=0 and last_accessed_ts on write (layers.py:130-135) so the merge-score terms exist from the first retrieval.

L2 conversation trim (MemoryLayers._trim_recent_messages, layers.py:294)

When L2 history exceeds its cap, messages are dropped oldest-first with one exception: the very first message (the topic-setter) is pinned provided it is small ( list[ToolDefinition] into tools/goat_skills/ to add a tool without a restart.

Observability

One MemoryObservation (memory/observability.py) is emitted as a JSON log line per turn, built by ObservationCollector (memory/observability_collector.py) and aggregated by the registry-owned MemoryAnalytics (memory/analytics.py). Fields include: AITS confidence/complexity/budget, L2.5 cache hit/miss + key, latency per stage (classify / search / assemble / inject / llm / save / total), tokens per tier (L0+L1 / L2 / L3), prefetch outcome (attempted / succeeded / timeout / blocks injected / blocks used), and results found/used. A summary report is logged every ANALYTICS_LOG_INTERVAL requests (default 100).

The intent_category field is a coarse, analytics-only label derived from the confidence tier (observability_collector.py:111): recall (≥ 0.4), greeting (< 0.3), else conversational. It labels the analytics tier and never gates prefetch.

Project layout

goat2/ ├── memory/ │ ├── layers.py # Backend mapper — the only memory interface (L0-L3 + L2.5) │ ├── aits.py # Adaptive Intent Token Scaling (confidence + complexity) │ ├── context_budget.py # Priority-inverted L2/L3 budget split │ ├── result_merger.py # Prefetch merge: dedupe + blended score (0.6/0.3/0.1) │ ├── query_classifier.py # 3-mechanism prefetch classification (temporal/thematic/specific-key) │ ├── temporal_parser.py # dateparser completed-past range extraction │ ├── session_cache.py # L2.5 TTL cache (Redis) │ ├── promote.py # L3 → L1 promotion, cap-guarded │ ├── budget.py # Token estimation + result-count enforcement │ ├── observability.py # MemoryObservation dataclass (per-turn JSON) │ ├── observability_collector.py # Per-turn observation builder + intent category │ ├── analytics.py # Registry-owned metrics aggregator + report │ ├── config.py # Reads config/memory.toml; all numeric knobs │ ├── working/working.py # Redis-backed working memory (L2) │ ├── episodic/ │ │ ├── episodic.py # ChromaDB lifecycle + store/search (L3) │ │ ├── queries.py # find_by_keys, bump_access, get_recent/count/delete (L3 mixin) │ │ └── warmup.py # Collection pre-warm at startup │ └── permanent/permanent.py # Letta-backed permanent memory (L0/L1) ├── orchestrator/ │ ├── orchestrator.py # Per-turn driver: prefetch → AITS → assemble →

[truncated for AI cost control]