2026-06-29 23:58 UTCIn-site rewrite5 min readUpdated: 2026-06-30 00:25 UTC

Self hosting the modern AI stack could be the way forward

llmaker is an open-source platform that lets you run the complete modern LLM stack on your own infrastructure — large language models, vector databases, embeddings, caching, observability, and a built-in retrieval & agent layer — provisioned, networked, and production-shaped from a single command.

SourceHacker News AIAuthor: sleepynoodle

Notifications You must be signed in to change notification settings

Fork 2

Star 72

BranchesTags

Open more actions menu

Folders and files

NameName

Last commit message

Last commit date

Latest commit

History

35 Commits

.github/workflows

agent

cmd/llmaker

docs

examples

facade

images

internal

scripts

.dockerignore

.gitignore

LICENSE

Makefile

README.md

go.mod

go.sum

Repository files navigation

Self-host the modern LLM stack.

llmaker is an open-source platform for running the complete modern LLM stack on your own infrastructure — large language models, vector databases, embeddings, caching, observability, and a built-in retrieval & agent layer — provisioned, networked, and production-shaped from a single command.

Build private retrieval-augmented chatbots, FAQ assistants, and recommendation engines locally. No third-party API keys. No data leaving your machine.

Quickstart · Why llmaker · Stacks · The agent · Architecture · CLI · Roadmap

Overview

Running a model locally is easy. Shipping an application is not. A production retrieval system needs a vector database, an embeddings service, a caching layer, an orchestration layer, and observability — each containerized, networked, and configured to discover the others. Assembling that is a recurring tax: a sprawl of docker run flags, a brittle Compose file, and hundreds of lines of framework glue.

llmaker removes that tax. One CLI provisions the entire stack on a private network and operates it as a single fleet — live status, logs, and a resource dashboard across every model and service. Stacks are declarative and reconcilable (apply --prune), models are OpenAI-compatible, and retrieval is traced out of the box. From a single model to a complete application:

── Build a complete application stack ──────────────────────────

llmaker stack up assistant # one command → a private ChatGPT-style UI over a local model llmaker stack init rag # …or scaffold any stack to edit, then apply it: llmaker apply # assistant · voice · rag · research · code · chatbot · faq · recommend · sql

── …or run a single model (OpenAI-compatible) ──────────────────

llmaker up --model llama3:8b # a local endpoint — explicit, or a preset: llmaker up chat # chat · code · small · embed · vision llmaker chat # test it in the terminal llmaker open # open its built-in web UI

── …or compose the stack à la carte, service by service ────────

llmaker service catalog # browse what's available llmaker service add qdrant # vector database → qdrant:6333 llmaker service add redis # cache / memory → redis:6379 llmaker service add langfuse # observability → langfuse:3000

── Operate the fleet ───────────────────────────────────────────

llmaker ls # every model + service, one view (--json) llmaker top # live resource dashboard (TUI) llmaker status # gauges, loaded models, endpoints llmaker logs -f # stream logs from any container llmaker pull mistral --on chat # download a model with progress llmaker stop / start / rm # lifecycle management

── Consume it — the agent's API, or any OpenAI client ──────────

AGENT=$(llmaker service ls --json | jq -r '.[]|select(.service=="agent").url') curl "$AGENT/api/ingest" -F [email protected] # add knowledge curl "$AGENT/api/chat" -d '{"question":"refund policy?"}' # grounded answer + sources curl "$AGENT/api/recommend" -d '{"like":["sku1","sku2"]}' # semantic recommendations

Everything lands on a private network where each container discovers the others by name — no Compose file and no glue code.

Highlights

The complete stack, curated Models and the infrastructure around them — vector databases (Qdrant, Chroma, pgvector, Weaviate), Redis, embeddings, Open WebUI, n8n, Flowise, Whisper, Langfuse — from one versioned catalog.

Automatic service discovery Every model and service joins a private Docker network and resolves by name. Your application reaches chat:8080 and qdrant:6333 with zero IP wiring.

A retrieval & tool agent, built in A FastAPI + LangGraph service: rewrite → retrieve → rerank → generate (multi-turn, MMR), a tool-calling loop (calculator, knowledge base, self-hosted web search, SQL), and a semantic recommendation API.

Observability by default Every instance exposes Prometheus /metrics (requests, tokens/sec, CPU/RAM/GPU) for scraping, and the RAG stack ships Langfuse — every query traced (retrieval hits and scores, model and token usage) with no setup.

Measurable quality An evaluation harness (/api/eval) grades answers for groundedness, relevance, and correctness with an LLM judge — retrieval quality you can track across changes, not guess at.

More than RAG First-class endpoints for summarization (map-reduce over long docs), structured JSON extraction, and speech-to-text (Whisper), plus optional Redis-backed conversation memory.

Declarative, reconcilable Define your stack in one file. llmaker apply brings it to the desired state in dependency order; --prune removes what's no longer declared.

OpenAI-compatible Each model exposes a stable /v1/* API (chat, completions, embeddings, streaming) behind one contract — Ollama runs it today, with a llama.cpp backend in progress.

Private by design Containers bind to 127.0.0.1 by default. Your documents, embeddings, and traces never leave your infrastructure. No per-token cost, no vendor lock-in.

Operable A single static Go binary, a labeled-container model with no state file to drift, --json output everywhere, and a live top dashboard.

Why self-host your LLM stack?

Data ownership. Proprietary documents, customer data, and prompts stay on hardware you control. Nothing is sent to a third-party API.

No assembly tax. The vector DB, embeddings, cache, agent, and tracing come pre-integrated and networked — not as a Compose file you maintain by hand.

Predictable cost. Inference and retrieval run on infrastructure you already pay for. No per-token billing, no rate limits.

Portability. The same stack.yaml runs on a laptop, a CI runner, or a server. Swap the model or the vector database without touching your application.

Model runners (Ollama, LM Studio) DIY Docker Compose Frameworks (LangChain) llmaker

Run local models, OpenAI-compatible ✓ — — ✓

Vector DB, embeddings, cache — curated — manual — ✓

Service discovery between containers — manual n/a ✓

One-command application (RAG, recsys) — — — ✓

Built-in retrieval & recommendation agent — — you code it ✓

Observability / tracing integrated — manual manual ✓

Declarative provisioning & reconciliation — partial — ✓

Installation

Requires Docker. Run llmaker doctor afterward to validate your environment.

Prebuilt binary (Linux / macOS)

curl -fsSL https://raw.githubusercontent.com/raiyanyahya/llmaker/master/scripts/install.sh | sh

Go toolchain

go install github.com/raiyanyahya/llmaker/cmd/llmaker@latest

From source

git clone https://github.com/raiyanyahya/llmaker && cd llmaker && make build

Homebrew and winget packages are on the roadmap. The agent image is built locally with make image-agent until it is published to a registry.

Quickstart

Provision and run a complete retrieval-augmented generation stack:

llmaker stack up assistant # scaffold + apply in one step (assistant needs no agent image) llmaker stack init rag # generate stack.yaml (assistant | voice | rag | research | code | chatbot | faq | recommend | sql) make image-agent # build the agent image once (stacks that include the agent) llmaker apply -f stack.yaml # provision the stack — model + services, networked llmaker ls # inspect models and services in one view

Resolve the agent endpoint and use it:

AGENT=$(llmaker service ls --json | jq -r '.[] | select(.service=="agent").url')

curl "$AGENT/api/ingest" -F [email protected] # ingest documents curl "$AGENT/api/chat" -d '{"question":"…","history":[],"top_k":4}' # query, with sources

llmaker also runs individual models — the easiest way to expose a local, OpenAI-compatible endpoint:

llmaker up --model llama3:8b # provision a model instance

from openai import OpenAI client = OpenAI(base_url="http://127.0.0.1:11500/v1", api_key="not-needed") client.chat.completions.create(model="llama3:8b", messages=[{"role": "user", "content": "Hello"}])

Stacks

A stack is a model plus the services around it, provisioned together. Scaffold and run one in a single step with llmaker stack up , or generate a stack.yaml to edit with llmaker stack init and apply it with llmaker apply.

Template Application Components

assistant A private, ChatGPT-style assistant over a local model — chats, prompts, RAG in the UI. No agent image to build LLM · Open WebUI

voice Talk to a model — speech-to-text in the browser via self-hosted Whisper. No agent image to build LLM · Open WebUI · Whisper

rag Document Q&A — ingest files, query with grounded answers and sources, fully traced LLM · Qdrant · embeddings · agent · Langfuse · Postgres

research A tool-using assistant that searches the live web and your documents, then synthesizes LLM · SearXNG · Qdrant · embeddings · agent

code A code assistant — ingest a repo, ask grounded questions and review code LLM · Qdrant · embeddings · agent

chatbot A multi-turn assistant with a web UI and per-session memory LLM · Redis · agent

faq A knowledge-base assistant tuned for short, grounded answers LLM · Qdrant · embeddings · agent

recommend A semantic recommendation engine — "more like this", no LLM required Qdrant · embeddings · agent

sql Ask your database in plain English — the agent runs read-only SQL (enforced) and grounds in docs LLM · Postgres · Qdrant · embeddings · agent

The agent

The catalog's agent is a FastAPI + LangGraph service (agent/) that turns a bare model and vector store into an application. It is a standard service on the network, configured by environment to discover the others by name.

Retrieval as an explicit graph — rewrite → retrieve → rerank → generate:

rewrite — collapses multi-turn history into a standalone query, so follow-ups that depend on context ("and when was it released?") resolve correctly. The model is only invoked when there is history to resolve.

retrieve — embeds the query and retrieves a candidate set from the vector store.

rerank — applies Maximal Marginal Relevance for relevant, non-redundant context.

generate — produces the answer from that context and the conversation.

POST /api/ingest multipart file or text → chunk, embed, store POST /api/chat { question, history?, top_k?, session_id? } → answer + sources POST /api/agent { question, history?, session_id? } → tool-using answer + tool calls POST /api/summarize { text, instructions?, max_words? } → summary (map-reduce for long text) POST /api/extract { text, fields: { name: description } } → JSON with exactly those keys POST /api/transcribe multipart audio file → { text } (needs a whisper service) POST /api/eval { cases: [{ question, reference? }] } → graded answers + summary POST /api/items { items: [{ id, text, metadata? }] } → index for recommendation POST /api/recommend { query } or { like: [id, …] } → ranked items

Tool calling. Beyond retrieval, /api/agent runs a tool-calling loop where the model decides which tools to invoke — a calculator, the knowledge base (retrieval as a tool), the current time, a self-hosted web search (SearXNG, no paid API), and an optional read-only SQL tool over your database — and the loop executes them until it has an answer. The response includes every tool call it made. Adding a tool is one entry in agent/app/tools.py.

Tracing. The rag stack provisions Langfuse and the agent traces every query to it, with zero configuration — each request (RAG or tool-using) appears as a trace with its retrieval, tool, and generation

[truncated for AI cost control]