AI News HubLIVE
站内改写

LocalVibe – Pure-Rust local AI stack with MCP, in one binary (Apple Silicon)

LocalVibe is a pure-Rust local AI coding assistant optimized for Apple Silicon. It provides chat with quantized LLMs via Metal, on-device ONNX embeddings, LanceDB vector search, and a TUI interface. It includes MCP server for Claude Code, an OpenAI-compatible HTTP server, and various tool integrations.

Article intelligence

InvestorsIntermediate

Key points

  • Pure-Rust binary with Candle+Metal inference, fastembed-rs embeddings, and LanceDB vector store.
  • TUI with five sections: Chat, Models, Databases, Index, Settings.
  • Supports MCP server for Claude Code and OpenAI-compatible HTTP server.
  • Configurable with GGUF models and multiple embedding backends; can be used alongside llama.cpp.

Why it matters

This matters because pure-Rust binary with Candle+Metal inference, fastembed-rs embeddings, and LanceDB vector store.

Technical impact

May affect model selection, inference cost, product capability, and evaluation benchmarks.

Notifications You must be signed in to change notification settings

Fork 0

Star 0

BranchesTags

Open more actions menu

Folders and files

NameName

Last commit message

Last commit date

Latest commit

History

91 Commits

91 Commits

.github/workflows

.github/workflows

crates

crates

docs

docs

.gitignore

.gitignore

CHANGELOG.md

CHANGELOG.md

CODE_OF_CONDUCT.md

CODE_OF_CONDUCT.md

CONTRIBUTING.md

CONTRIBUTING.md

Cargo.lock

Cargo.lock

Cargo.toml

Cargo.toml

LICENSE-APACHE

LICENSE-APACHE

LICENSE-MIT

LICENSE-MIT

README.md

README.md

SECURITY.md

SECURITY.md

local-vibe.example.toml

local-vibe.example.toml

Repository files navigation

Pure-Rust local coding assistant: chat with a quantized LLM on Metal, index any directory with on-device ONNX embeddings, search it with LanceDB, all from one ratatui TUI.

Runs on Apple Silicon (M1–M4). Candle + Metal for inference, fastembed-rs for embeddings, LanceDB for vectors.

A TUI screenshot will live here once one is captured — see docs/screenshots/.

Quick start

Assumes ~/.cargo/bin is on PATH, you are on macOS, and you have a GGUF model supported by Candle (qwen2 / llama family — Qwen 3.5 hybrid SSM is not supported).

1. install the localvibe binary (lv alias is also installed)

git clone https://github.com/Sok205/local_vibe ~/code/local_vibe cd ~/code/local_vibe cargo install --path crates/lv-cli

2. download a chat model (~4.6 GB)

DEST=~/.lmstudio/models/lmstudio-community/Qwen2.5-7B-Instruct-GGUF mkdir -p "$DEST" curl -L -o "$DEST/Qwen2.5-7B-Instruct-Q4_K_M.gguf" \ https://huggingface.co/bartowski/Qwen2.5-7B-Instruct-GGUF/resolve/main/Qwen2.5-7B-Instruct-Q4_K_M.gguf curl -L -o "$DEST/tokenizer.json" \ https://huggingface.co/Qwen/Qwen2.5-7B-Instruct/resolve/main/tokenizer.json

3. write config (macOS path — dirs::config_dir())

mkdir -p ~/Library/Application\ Support/local-vibe cp local-vibe.example.toml ~/Library/Application\ Support/local-vibe/config.toml

…edit the paths inside to point at your real GGUF + tokenizer

4. run

lv # TUI lv ask "explain lifetimes in 2 sentences"

Inside the TUI, F1..F5 (or Ctrl+1..5 where your terminal supports it) jumps between Chat · Models · Databases · Index · Settings. Everything else is discoverable by sight. No slash commands to memorise.

First TUI launch takes ~5 s to memory-map the 4.4 GB GGUF and ~10 s extra on first fastembed run (downloads the ONNX embedding weights into ./.fastembed_cache/).

How it works

┌───────────────────────────────────────────────────────────┐ │ lv-cli (binary) │ │ main.rs → AppContext (impl AppHost) → dispatcher │ │ │ │ │ ┌─────────────────┼─────────────────┐ │ │ ▼ ▼ ▼ │ │ lv-tui lv-inference lv-rag │ │ ratatui UI fastembed / LanceDB store + │ │ + overlay mlx-lm indexer + chunker │ │ framework EmbeddingBackend + tree-sitter │ │ ▲ ▲ │ │ │ │ │ │ lv-metal lv-core │ │ Candle+Metal traits, config, │ │ InferenceBackend types, status, │ │ AppHost │ │ │ │ lv-mcp ◄── Arc │ │ stdio MCP server for Claude Code │ └───────────────────────────────────────────────────────────┘

Three swappable trait pairs in lv-core:

InferenceBackend — streams chat completions. Implementations: MetalBackend (Candle GGUF, on-device), MlxLmBackend (Python HTTP fallback).

EmbeddingBackend — produces 384 / 768-d float vectors. Implementations: FastEmbedBackend (ONNX, pure Rust — default), MlxLmBackend (HTTP fallback).

AppHost — narrow capability surface that AppContext implements; MCP and the TUI reach application state through it, which keeps lv-mcp free of a circular dep on lv-cli.

AppContext (in crates/lv-cli/src/app_context.rs) keeps a per-tier HashMap> plus an active_tier, so you can load / unload / switch chat models at runtime. Named vector stores are cached the same way.

Configuration

lv reads, in order:

./local-vibe.toml (current directory)

~/Library/Application Support/local-vibe/config.toml (macOS) — or ~/.config/local-vibe/config.toml (Linux)

Minimal working config:

[models.medium] # chat model name = "qwen2.5-7b-instruct" backend = "metal" model_path = "/Users/YOU/…/Qwen2.5-7B-Instruct-Q4_K_M.gguf" tokenizer_path = "/Users/YOU/…/tokenizer.json"

[models.embedding] # omit this section to disable RAG name = "bge-small-en" # or "nomic-embed-text" (768-d)

backend defaults to "fastembed" — no Python

[rag] db_root = "/Users/YOU/.local/share/local-vibe/dbs" # enables multi-DB mode

Accepted embedding model names: bge-small-en (384-d, ~130 MB), bge-base-en (768-d), nomic-embed-text-v1.5 (768-d, ~260 MB).

Declare [models.fast] and [models.strong] the same way if you want to switch between tiers from inside the TUI (F2 → Enter on the tier you want).

Omit db_root to stay in single-DB mode at [rag].db_dir (default: ~/Library/Application Support/local-vibe/db).

A full annotated example lives in local-vibe.example.toml at the repo root.

CLI reference

lv # launch TUI (default) lv ask "" # one-shot chat; streams to stdout lv index # index a directory into the current DB lv status # full snapshot: models + every DB + runtime state lv status --json # same, as JSON (for piping into Claude Code etc.) lv stats # chunk / file counts in the current DB (legacy) lv dbs # list DB names (single line each; --json available) lv ls # list files in a DB (--limit N, --json available) lv models # print the configured backend for each tier lv serve # MCP server on stdio (for Claude Code etc.) lv http # OpenAI-compatible HTTP server (chat completions + tool use) lv --help

CLI commands log to stderr. The TUI logs to ~/.local/share/local-vibe/lv.log so log lines don't overlap the UI (tail it with tail -f ~/.local/share/local-vibe/lv.log).

TUI reference

The layout borrows from LM Studio: a persistent left sidebar with five first-class sections, an always-on status strip, and a context-sensitive hint line at the bottom. There's no command palette — everything is one Ctrl+N jump away.

┌ local-vibe ── chat: qwen2.5-7b (medium · warm) · db: rust-rag · 2 warm · idle ─┐ │ F1 Chat │ ┌─ Chat ───────────────────────┬─ Context ──────────────┐ │ │>F2 Models │ │ You: … │ rust-book.md #3 │ │ │ F3 Databases │ │ AI: … │ "Spawning Tasks" │ │ │ F4 Index │ │ │ │ │ │ F5 Settings │ │ > _ │ │ │ │ │ └──────────────────────────────┴────────────────────────┘ │ │ ?: help │ Enter send · Tab → Context · ↑↓ scroll · F1..F5 sect. │ └───────────────┴──────────────────────────────────────────────────────────────┘

Global keys

Key Effect

Ctrl+1 … Ctrl+5 jump to Chat · Models · Databases · Index · Settings

Tab cycle focus between sub-panes of the current section

Esc back out of a focused sub-pane or peek overlay

? (when not typing) toggle the help overlay

Ctrl-C / Ctrl-Q quit

F1 · Chat

Two-column layout, always. Left (~70%) is the conversation + input; right (~30%) is the Context pane showing retrieved chunks for the last answer. Tab toggles focus input ↔ context. Enter sends. ↑/↓ scroll the history (input focus) or move a cursor over chunks (context focus). Typing /anything (except /quit) is passed to the model as prose — no special slash handling.

F2 · Models

One row per slot: fast · medium · strong · cloud · embed. Columns show name, backend, warm/cold state, and an active marker.

Key on a selected row Effect

Enter on cold load the tier and make it active for chat

Enter on warm make it active without re-loading

l load (but don't change active tier)

u unload (refused on the currently active tier)

a set active — requires the tier to already be warm

F3 · Databases

Two columns. Left: every DB with an active marker. Right: detail for the selected DB — path, indexed-at timestamp, file and chunk counts, top-5 language histogram, last error if any.

Key Effect

↑ / ↓ select a DB

Enter activate (and jump back to Chat)

b file browser peek (language pills 1…9, 0 clears)

F4 · Index

Two text fields stacked: Path and Into. Entering the section prefills Into with the active DB. Tab inside Path runs filesystem completion; falling through, it cycles focus. Enter submits. While indexing, a magenta progress bar shows done/total and the current file. ↑/↓ cycles between fields.

F5 · Settings

Read-only: version, config path, DB root, process id, warm models and DBs, session id. Right panel has a compact global + per-section keybind reference. Not editable in this version — config changes are still a TOML edit + restart.

Status strip

Dot-separated segments at the top of every screen:

◆ local-vibe · medium:qwen2.5-7b · db:rust-rag · 52 files · 2 warm

The active model turns yellow during load and green once warm. N warm counts every tier held in memory including the embedder. A magenta indexing done/total: file segment appears while an index run is in flight.

Use as an MCP server

lv serve speaks MCP over stdio, so any MCP client (Claude Code, Cursor, custom agents) can call into the local index. Five tools are exposed; the DB-specific ones accept an optional db argument that defaults to the server's current DB.

Tool What it does

search_code semantic search; filters by language / file_path / db

index_directory parse + chunk + embed a directory into the store (or db)

get_stats total chunks and unique files, optionally per db

list_sources summary of indexed files, optionally per db

get_status full snapshot JSON: models, every DB, runtime state

Wire it into Claude Code:

claude mcp add lv lv serve

The server uses the current DB (whichever F3 → Enter would pick in the TUI) when no db argument is given. Logs go to ~/.local/share/local-vibe/lv-mcp.log so they don't corrupt the JSON-RPC frames on stdout.

Use as an HTTP server (OpenAI-compatible)

lv http exposes the in-process Candle backend behind an OpenAI Chat Completions API on localhost. Any OpenAI-compatible client (Zed AI, claude-code-router, generic SDKs) can drive it.

lv http # 127.0.0.1:8080, lazy model load lv http --tier medium # pre-load the medium tier on startup lv http --host 0.0.0.0 --port 9000 # bind elsewhere

Endpoints:

Method + path Behavior

GET /health {"status":"ok"}

GET /v1/models lists fast / medium / strong aliases plus the configured names

POST /v1/chat/completions OpenAI Chat Completions; streaming (SSE) and non-streaming both supported

The model field accepts "fast" / "medium" / "strong" (mapped to the matching [models.] slot) or any of your configured model names. Unknown values fall back to medium.

Tool use

Tool calling is layered at the HTTP boundary. When a request includes a tools array, lv http:

Renders the tool catalog as a Hermes-format JSON block and merges it into the system message.

Forces non-streaming for that turn so the full response can be parsed.

Extracts every {...} from the model output and returns them as OpenAI-shaped tool_calls with finish_reason: "tool_calls".

This keeps InferenceBackend text-in / text-out and means tool support works on any model that can follow the format prompt (Qwen 2.5 / 3 / 3-Coder, etc.). Models without explicit tool training will be less reliable; treat tool support as best-effort on small generalist models.

Hybrid stack with llama.cpp (for qwen35 etc.)

Candle currently has no backend for the Qwen 3.5 / 3.6 hybrid-SSM architecture (general.architecture = "qwen35"). Until Candle adds support, the recommended way to run those models is to keep lv for RAG, MCP, and the architectures it does serve, and run llama-server from llama.cpp alongside it for the rest:

brew install llama.cpp # or build from source

Start llama-server on a different port; --jinja enables the model's

native tool-call template.

llama-server \ -m ~/Models/.../Qwen3.6-27B-Q6_K.gguf \ --host 127.0.0.1 --port 8081 \ --jinja -c 32768 -ngl 99 \ --alias qwen3.6-27b

Suggested topology:

Claude Code / Zed AI ─┬─→ lv http :8080 (qwen2 / qwen3 via Candle) └─→ llama-server :8081 (qwen35 / hybrid SSM)

lv serve (stdio) ← Claude Code MCP (RAG over your indexed corpus)

[truncated for AI cost control]