AI News HubLIVE
In-site rewrite5 min read

Self hosting the modern AI stack could be the way forward

llmaker is an open-source platform that lets you run the complete modern LLM stack on your own infrastructure — large language models, vector databases, embeddings, caching, observability, and a built-in retrieval & agent layer — provisioned, networked, and production-shaped from a single command.

SourceHacker News AIAuthor: sleepynoodle

Notifications You must be signed in to change notification settings

Fork 2

Star 72

BranchesTags

Open more actions menu

Folders and files

NameName

Last commit message

Last commit date

Latest commit

History

35 Commits

35 Commits

.github/workflows

.github/workflows

agent

agent

cmd/llmaker

cmd/llmaker

docs

docs

examples

examples

facade

facade

images

images

internal

internal

scripts

scripts

.dockerignore

.dockerignore

.gitignore

.gitignore

LICENSE

LICENSE

Makefile

Makefile

README.md

README.md

go.mod

go.mod

go.sum

go.sum

Repository files navigation

Self-host the modern LLM stack.

llmaker is an open-source platform for running the complete modern LLM stack on your own infrastructure — large language models, vector databases, embeddings, caching, observability, and a built-in retrieval & agent layer — provisioned, networked, and production-shaped from a single command.

Build private retrieval-augmented chatbots, FAQ assistants, and recommendation engines locally. No third-party API keys. No data leaving your machine.

Quickstart · Why llmaker · Stacks · The agent · Architecture · CLI · Roadmap

Overview

Running a model locally is easy. Shipping an application is not. A production retrieval system needs a vector database, an embeddings service, a caching layer, an orchestration layer, and observability — each containerized, networked, and configured to discover the others. Assembling that is a recurring tax: a sprawl of docker run flags, a brittle Compose file, and hundreds of lines of framework glue.

llmaker removes that tax. One CLI provisions the entire stack on a private network and operates it as a single fleet — live status, logs, and a resource dashboard across every model and service. Stacks are declarative and reconcilable (apply --prune), models are OpenAI-compatible, and retrieval is traced out of the box. From a single model to a complete application:

── Build a complete application stack ──────────────────────────

llmaker stack up assistant # one command → a private ChatGPT-style UI over a local model llmaker stack init rag # …or scaffold any stack to edit, then apply it: llmaker apply # assistant · voice · rag · research · code · chatbot · faq · recommend · sql

── …or run a single model (OpenAI-compatible) ──────────────────

llmaker up --model llama3:8b # a local endpoint — explicit, or a preset: llmaker up chat # chat · code · small · embed · vision llmaker chat # test it in the terminal llmaker open # open its built-in web UI

── …or compose the stack à la carte, service by service ────────

llmaker service catalog # browse what's available llmaker service add qdrant # vector database → qdrant:6333 llmaker service add redis # cache / memory → redis:6379 llmaker service add langfuse # observability → langfuse:3000

── Operate the fleet ───────────────────────────────────────────

llmaker ls # every model + service, one view (--json) llmaker top # live resource dashboard (TUI) llmaker status # gauges, loaded models, endpoints llmaker logs -f # stream logs from any container llmaker pull mistral --on chat # download a model with progress llmaker stop / start / rm # lifecycle management

── Consume it — the agent's API, or any OpenAI client ──────────

AGENT=$(llmaker service ls --json | jq -r '.[]|select(.service=="agent").url') curl "$AGENT/api/ingest" -F [email protected] # add knowledge curl "$AGENT/api/chat" -d '{"question":"refund policy?"}' # grounded answer + sources curl "$AGENT/api/recommend" -d '{"like":["sku1","sku2"]}' # semantic recommendations

Everything lands on a private network where each container discovers the others by name — no Compose file and no glue code.

Highlights

The complete stack, curated Models and the infrastructure around them — vector databases (Qdrant, Chroma, pgvector, Weaviate), Redis, embeddings, Open WebUI, n8n, Flowise, Whisper, Langfuse — from one versioned catalog.

Automatic service discovery Every model and service joins a private Docker network and resolves by name. Your application reaches chat:8080 and qdrant:6333 with zero IP wiring.

A retrieval & tool agent, built in A FastAPI + LangGraph service: rewrite → retrieve → rerank → generate (multi-turn, MMR), a tool-calling loop (calculator, knowledge base, self-hosted web search, SQL), and a semantic recommendation API.

Observability by default Every instance exposes Prometheus /metrics (requests, tokens/sec, CPU/RAM/GPU) for scraping, and the RAG stack ships Langfuse — every query traced (retrieval hits and scores, model and token usage) with no setup.

Measurable quality An evaluation harness (/api/eval) grades answers for groundedness, relevance, and correctness with an LLM judge — retrieval quality you can track across changes, not guess at.

More than RAG First-class endpoints for summarization (map-reduce over long docs), structured JSON extraction, and speech-to-text (Whisper), plus optional Redis-backed conversation memory.

Declarative, reconcilable Define your stack in one file. llmaker apply brings it to the desired state in dependency order; --prune removes what's no longer declared.

OpenAI-compatible Each model exposes a stable /v1/* API (chat, completions, embeddings, streaming) behind one contract — Ollama runs it today, with a llama.cpp backend in progress.

Private by design Containers bind to 127.0.0.1 by default. Your documents, embeddings, and traces never leave your infrastructure. No per-token cost, no vendor lock-in.

Operable A single static Go binary, a labeled-container model with no state file to drift, --json output everywhere, and a live top dashboard.

Why self-host your LLM stack?

Data ownership. Proprietary documents, customer data, and prompts stay on hardware you control. Nothing is sent to a third-party API.

No assembly tax. The vector DB, embeddings, cache, agent, and tracing come pre-integrated and networked — not as a Compose file you maintain by hand.

Predictable cost. Inference and retrieval run on infrastructure you already pay for. No per-token billing, no rate limits.

Portability. The same stack.yaml runs on a laptop, a CI runner, or a server. Swap the model or the vector database without touching your application.

Model runners (Ollama, LM Studio) DIY Docker Compose Frameworks (LangChain) llmaker

Run local models, OpenAI-compatible ✓ — — ✓

Vector DB, embeddings, cache — curated — manual — ✓

Service discovery between containers — manual n/a ✓

One-command application (RAG, recsys) — — — ✓

Built-in retrieval & recommendation agent — — you code it ✓

Observability / tracing integrated — manual manual ✓

Declarative provisioning & reconciliation — partial — ✓

Installation

Requires Docker. Run llmaker doctor afterward to validate your environment.

Prebuilt binary (Linux / macOS)

curl -fsSL https://raw.githubusercontent.com/raiyanyahya/llmaker/master/scripts/install.sh | sh

Go toolchain

go install github.com/raiyanyahya/llmaker/cmd/llmaker@latest

From source

git clone https://github.com/raiyanyahya/llmaker && cd llmaker && make build

Homebrew and winget packages are on the roadmap. The agent image is built locally with make image-agent until it is published to a registry.

Quickstart

Provision and run a complete retrieval-augmented generation stack:

llmaker stack up assistant # scaffold + apply in one step (assistant needs no agent image) llmaker stack init rag # generate stack.yaml (assistant | voice | rag | research | code | chatbot | faq | recommend | sql) make image-agent # build the agent image once (stacks that include the agent) llmaker apply -f stack.yaml # provision the stack — model + services, networked llmaker ls # inspect models and services in one view

Resolve the agent endpoint and use it:

AGENT=$(llmaker service ls --json | jq -r '.[] | select(.service=="agent").url')

curl "$AGENT/api/ingest" -F [email protected] # ingest documents curl "$AGENT/api/chat" -d '{"question":"…","history":[],"top_k":4}' # query, with sources

llmaker also runs individual models — the easiest way to expose a local, OpenAI-compatible endpoint:

llmaker up --model llama3:8b # provision a model instance

from openai import OpenAI client = OpenAI(base_url="http://127.0.0.1:11500/v1", api_key="not-needed") client.chat.completions.create(model="llama3:8b", messages=[{"role": "user", "content": "Hello"}])

Stacks

A stack is a model plus the services around it, provisioned together. Scaffold and run one in a single step with llmaker stack up , or generate a stack.yaml to edit with llmaker stack init and apply it with llmaker apply.

Template Application Components

assistant A private, ChatGPT-style assistant over a local model — chats, prompts, RAG in the UI. No agent image to build LLM · Open WebUI

voice Talk to a model — speech-to-text in the browser via self-hosted Whisper. No agent image to build LLM · Open WebUI · Whisper

rag Document Q&A — ingest files, query with grounded answers and sources, fully traced LLM · Qdrant · embeddings · agent · Langfuse · Postgres

research A tool-using assistant that searches the live web and your documents, then synthesizes LLM · SearXNG · Qdrant · embeddings · agent

code A code assistant — ingest a repo, ask grounded questions and review code LLM · Qdrant · embeddings · agent

chatbot A multi-turn assistant with a web UI and per-session memory LLM · Redis · agent

faq A knowledge-base assistant tuned for short, grounded answers LLM · Qdrant · embeddings · agent

recommend A semantic recommendation engine — "more like this", no LLM required Qdrant · embeddings · agent

sql Ask your database in plain English — the agent runs read-only SQL (enforced) and grounds in docs LLM · Postgres · Qdrant · embeddings · agent

The agent

The catalog's agent is a FastAPI + LangGraph service (agent/) that turns a bare model and vector store into an application. It is a standard service on the network, configured by environment to discover the others by name.

Retrieval as an explicit graph — rewrite → retrieve → rerank → generate:

rewrite — collapses multi-turn history into a standalone query, so follow-ups that depend on context ("and when was it released?") resolve correctly. The model is only invoked when there is history to resolve.

retrieve — embeds the query and retrieves a candidate set from the vector store.

rerank — applies Maximal Marginal Relevance for relevant, non-redundant context.

generate — produces the answer from that context and the conversation.

POST /api/ingest multipart file or text → chunk, embed, store POST /api/chat { question, history?, top_k?, session_id? } → answer + sources POST /api/agent { question, history?, session_id? } → tool-using answer + tool calls POST /api/summarize { text, instructions?, max_words? } → summary (map-reduce for long text) POST /api/extract { text, fields: { name: description } } → JSON with exactly those keys POST /api/transcribe multipart audio file → { text } (needs a whisper service) POST /api/eval { cases: [{ question, reference? }] } → graded answers + summary POST /api/items { items: [{ id, text, metadata? }] } → index for recommendation POST /api/recommend { query } or { like: [id, …] } → ranked items

Tool calling. Beyond retrieval, /api/agent runs a tool-calling loop where the model decides which tools to invoke — a calculator, the knowledge base (retrieval as a tool), the current time, a self-hosted web search (SearXNG, no paid API), and an optional read-only SQL tool over your database — and the loop executes them until it has an answer. The response includes every tool call it made. Adding a tool is one entry in agent/app/tools.py.

Tracing. The rag stack provisions Langfuse and the agent traces every query to it, with zero configuration — each request (RAG or tool-using) appears as a trace with its retrieval, tool, and generation

[truncated for AI cost control]