Self hosting the modern AI stack could be the way forward
llmaker is an open-source platform that lets you run the complete modern LLM stack on your own infrastructure — large language models, vector databases, embeddings, caching, observability, and a built-in retrieval & agent layer — provisioned, networked, and production-shaped from a single command.
Notifications You must be signed in to change notification settings
Fork 2
Star 72
BranchesTags
Open more actions menu
Folders and files
NameName
Last commit message
Last commit date
Latest commit
History
35 Commits
35 Commits
.github/workflows
.github/workflows
agent
agent
cmd/llmaker
cmd/llmaker
docs
docs
examples
examples
facade
facade
images
images
internal
internal
scripts
scripts
.dockerignore
.dockerignore
.gitignore
.gitignore
LICENSE
LICENSE
Makefile
Makefile
README.md
README.md
go.mod
go.mod
go.sum
go.sum
Repository files navigation
Self-host the modern LLM stack.
llmaker is an open-source platform for running the complete modern LLM stack on your own infrastructure — large language models, vector databases, embeddings, caching, observability, and a built-in retrieval & agent layer — provisioned, networked, and production-shaped from a single command.
Build private retrieval-augmented chatbots, FAQ assistants, and recommendation engines locally. No third-party API keys. No data leaving your machine.
Quickstart · Why llmaker · Stacks · The agent · Architecture · CLI · Roadmap
Overview
Running a model locally is easy. Shipping an application is not. A production retrieval system needs a vector database, an embeddings service, a caching layer, an orchestration layer, and observability — each containerized, networked, and configured to discover the others. Assembling that is a recurring tax: a sprawl of docker run flags, a brittle Compose file, and hundreds of lines of framework glue.
llmaker removes that tax. One CLI provisions the entire stack on a private network and operates it as a single fleet — live status, logs, and a resource dashboard across every model and service. Stacks are declarative and reconcilable (apply --prune), models are OpenAI-compatible, and retrieval is traced out of the box. From a single model to a complete application:
── Build a complete application stack ──────────────────────────
llmaker stack up assistant # one command → a private ChatGPT-style UI over a local model llmaker stack init rag # …or scaffold any stack to edit, then apply it: llmaker apply # assistant · voice · rag · research · code · chatbot · faq · recommend · sql
── …or run a single model (OpenAI-compatible) ──────────────────
llmaker up --model llama3:8b # a local endpoint — explicit, or a preset: llmaker up chat # chat · code · small · embed · vision llmaker chat # test it in the terminal llmaker open # open its built-in web UI
── …or compose the stack à la carte, service by service ────────
llmaker service catalog # browse what's available llmaker service add qdrant # vector database → qdrant:6333 llmaker service add redis # cache / memory → redis:6379 llmaker service add langfuse # observability → langfuse:3000
── Operate the fleet ───────────────────────────────────────────
llmaker ls # every model + service, one view (--json) llmaker top # live resource dashboard (TUI) llmaker status # gauges, loaded models, endpoints llmaker logs -f # stream logs from any container llmaker pull mistral --on chat # download a model with progress llmaker stop / start / rm # lifecycle management
── Consume it — the agent's API, or any OpenAI client ──────────
AGENT=$(llmaker service ls --json | jq -r '.[]|select(.service=="agent").url') curl "$AGENT/api/ingest" -F [email protected] # add knowledge curl "$AGENT/api/chat" -d '{"question":"refund policy?"}' # grounded answer + sources curl "$AGENT/api/recommend" -d '{"like":["sku1","sku2"]}' # semantic recommendations
Everything lands on a private network where each container discovers the others by name — no Compose file and no glue code.
Highlights
The complete stack, curated Models and the infrastructure around them — vector databases (Qdrant, Chroma, pgvector, Weaviate), Redis, embeddings, Open WebUI, n8n, Flowise, Whisper, Langfuse — from one versioned catalog.
Automatic service discovery Every model and service joins a private Docker network and resolves by name. Your application reaches chat:8080 and qdrant:6333 with zero IP wiring.
A retrieval & tool agent, built in A FastAPI + LangGraph service: rewrite → retrieve → rerank → generate (multi-turn, MMR), a tool-calling loop (calculator, knowledge base, self-hosted web search, SQL), and a semantic recommendation API.
Observability by default Every instance exposes Prometheus /metrics (requests, tokens/sec, CPU/RAM/GPU) for scraping, and the RAG stack ships Langfuse — every query traced (retrieval hits and scores, model and token usage) with no setup.
Measurable quality An evaluation harness (/api/eval) grades answers for groundedness, relevance, and correctness with an LLM judge — retrieval quality you can track across changes, not guess at.
More than RAG First-class endpoints for summarization (map-reduce over long docs), structured JSON extraction, and speech-to-text (Whisper), plus optional Redis-backed conversation memory.
Declarative, reconcilable Define your stack in one file. llmaker apply brings it to the desired state in dependency order; --prune removes what's no longer declared.
OpenAI-compatible Each model exposes a stable /v1/* API (chat, completions, embeddings, streaming) behind one contract — Ollama runs it today, with a llama.cpp backend in progress.
Private by design Containers bind to 127.0.0.1 by default. Your documents, embeddings, and traces never leave your infrastructure. No per-token cost, no vendor lock-in.
Operable A single static Go binary, a labeled-container model with no state file to drift, --json output everywhere, and a live top dashboard.
Why self-host your LLM stack?
Data ownership. Proprietary documents, customer data, and prompts stay on hardware you control. Nothing is sent to a third-party API.
No assembly tax. The vector DB, embeddings, cache, agent, and tracing come pre-integrated and networked — not as a Compose file you maintain by hand.
Predictable cost. Inference and retrieval run on infrastructure you already pay for. No per-token billing, no rate limits.
Portability. The same stack.yaml runs on a laptop, a CI runner, or a server. Swap the model or the vector database without touching your application.
Model runners (Ollama, LM Studio) DIY Docker Compose Frameworks (LangChain) llmaker
Run local models, OpenAI-compatible ✓ — — ✓
Vector DB, embeddings, cache — curated — manual — ✓
Service discovery between containers — manual n/a ✓
One-command application (RAG, recsys) — — — ✓
Built-in retrieval & recommendation agent — — you code it ✓
Observability / tracing integrated — manual manual ✓
Declarative provisioning & reconciliation — partial — ✓
Installation
Requires Docker. Run llmaker doctor afterward to validate your environment.
Prebuilt binary (Linux / macOS)
curl -fsSL https://raw.githubusercontent.com/raiyanyahya/llmaker/master/scripts/install.sh | sh
Go toolchain
go install github.com/raiyanyahya/llmaker/cmd/llmaker@latest
From source
git clone https://github.com/raiyanyahya/llmaker && cd llmaker && make build
Homebrew and winget packages are on the roadmap. The agent image is built locally with make image-agent until it is published to a registry.
Quickstart
Provision and run a complete retrieval-augmented generation stack:
llmaker stack up assistant # scaffold + apply in one step (assistant needs no agent image) llmaker stack init rag # generate stack.yaml (assistant | voice | rag | research | code | chatbot | faq | recommend | sql) make image-agent # build the agent image once (stacks that include the agent) llmaker apply -f stack.yaml # provision the stack — model + services, networked llmaker ls # inspect models and services in one view
Resolve the agent endpoint and use it:
AGENT=$(llmaker service ls --json | jq -r '.[] | select(.service=="agent").url')
curl "$AGENT/api/ingest" -F [email protected] # ingest documents curl "$AGENT/api/chat" -d '{"question":"…","history":[],"top_k":4}' # query, with sources
llmaker also runs individual models — the easiest way to expose a local, OpenAI-compatible endpoint:
llmaker up --model llama3:8b # provision a model instance
from openai import OpenAI client = OpenAI(base_url="http://127.0.0.1:11500/v1", api_key="not-needed") client.chat.completions.create(model="llama3:8b", messages=[{"role": "user", "content": "Hello"}])
Stacks
A stack is a model plus the services around it, provisioned together. Scaffold and run one in a single step with llmaker stack up , or generate a stack.yaml to edit with llmaker stack init and apply it with llmaker apply.
Template Application Components
assistant A private, ChatGPT-style assistant over a local model — chats, prompts, RAG in the UI. No agent image to build LLM · Open WebUI
voice Talk to a model — speech-to-text in the browser via self-hosted Whisper. No agent image to build LLM · Open WebUI · Whisper
rag Document Q&A — ingest files, query with grounded answers and sources, fully traced LLM · Qdrant · embeddings · agent · Langfuse · Postgres
research A tool-using assistant that searches the live web and your documents, then synthesizes LLM · SearXNG · Qdrant · embeddings · agent
code A code assistant — ingest a repo, ask grounded questions and review code LLM · Qdrant · embeddings · agent
chatbot A multi-turn assistant with a web UI and per-session memory LLM · Redis · agent
faq A knowledge-base assistant tuned for short, grounded answers LLM · Qdrant · embeddings · agent
recommend A semantic recommendation engine — "more like this", no LLM required Qdrant · embeddings · agent
sql Ask your database in plain English — the agent runs read-only SQL (enforced) and grounds in docs LLM · Postgres · Qdrant · embeddings · agent
The agent
The catalog's agent is a FastAPI + LangGraph service (agent/) that turns a bare model and vector store into an application. It is a standard service on the network, configured by environment to discover the others by name.
Retrieval as an explicit graph — rewrite → retrieve → rerank → generate:
rewrite — collapses multi-turn history into a standalone query, so follow-ups that depend on context ("and when was it released?") resolve correctly. The model is only invoked when there is history to resolve.
retrieve — embeds the query and retrieves a candidate set from the vector store.
rerank — applies Maximal Marginal Relevance for relevant, non-redundant context.
generate — produces the answer from that context and the conversation.
POST /api/ingest multipart file or text → chunk, embed, store POST /api/chat { question, history?, top_k?, session_id? } → answer + sources POST /api/agent { question, history?, session_id? } → tool-using answer + tool calls POST /api/summarize { text, instructions?, max_words? } → summary (map-reduce for long text) POST /api/extract { text, fields: { name: description } } → JSON with exactly those keys POST /api/transcribe multipart audio file → { text } (needs a whisper service) POST /api/eval { cases: [{ question, reference? }] } → graded answers + summary POST /api/items { items: [{ id, text, metadata? }] } → index for recommendation POST /api/recommend { query } or { like: [id, …] } → ranked items
Tool calling. Beyond retrieval, /api/agent runs a tool-calling loop where the model decides which tools to invoke — a calculator, the knowledge base (retrieval as a tool), the current time, a self-hosted web search (SearXNG, no paid API), and an optional read-only SQL tool over your database — and the loop executes them until it has an answer. The response includes every tool call it made. Adding a tool is one entry in agent/app/tools.py.
Tracing. The rag stack provisions Langfuse and the agent traces every query to it, with zero configuration — each request (RAG or tool-using) appears as a trace with its retrieval, tool, and generation
[truncated for AI cost control]