2026-05-13站内改写

Show HN: Docker AI Stack – Deploy 8 self-hosted AI services with one command

Docker AI Stack is a complete self-hosted AI stack including Ollama, LiteLLM, Whisper, Kokoro, Docling, and more, deployable with a single command. It features zero-configuration, automatic API key generation, full local processing, GPU acceleration, and lightweight sub-stacks.

Article intelligence

EngineersAdvanced

Key points

Deploy 8 self-hosted AI services with one command, zero-config, auto-generated API keys
Includes local LLM, speech-to-text, text-to-speech, document parsing, and MCP tools
Services are private by default with optional auth; reverse proxy recommended for public deployments
Offers lightweight sub-stacks with as low as ~2.5 GB RAM, supports NVIDIA GPU acceleration

Why it matters

This matters because deploy 8 self-hosted AI services with one command, zero-config, auto-generated API keys.

Technical impact

May affect model selection, inference cost, product capability, and evaluation benchmarks.

Notifications You must be signed in to change notification settings

Fork 1

Star 13

BranchesTags

Open more actions menu

Folders and files

NameName

Last commit message

Last commit date

Latest commit

History

24 Commits

docs

stacks

.gitignore

LICENSE.md

README-ru.md

README-zh-Hant.md

README-zh.md

README.md

docker-compose.cuda.yml

docker-compose.yml

stack-check.sh

Repository files navigation

Deploy a complete, self-hosted AI stack on your own server with a single command.

Zero-config: all services auto-configure on first start

Secure: Ollama, LiteLLM, and MCP Gateway generate API keys automatically

Private: audio, embeddings, and LLM inference all run locally — no data sent to third parties

Optional auth: Whisper, WhisperLive, Kokoro, Embeddings, and Docling work without API keys by default (set keys via env files for public deployments)

Lightweight stacks for lower memory requirements (as low as ~2.5 GB)

GPU acceleration via NVIDIA CUDA

Note: When using LiteLLM with external providers (e.g., OpenAI, Anthropic), your data will be sent to those providers.

Services included:

Service Role Default port

Ollama (LLM) Runs local LLM models (llama3, qwen, mistral, etc.) 11434

LiteLLM AI gateway — routes requests to Ollama, OpenAI, Anthropic, and 100+ providers 4000

Embeddings Converts text to vectors for semantic search and RAG 8000

Whisper (STT) Transcribes spoken audio to text 9000

WhisperLive (real-time STT) Real-time speech-to-text transcription over WebSocket 9090

Kokoro (TTS) Converts text to natural-sounding speech 8880

MCP Gateway Provides MCP tools (filesystem, fetch, GitHub, search, databases) to AI clients 3000

Docling Converts documents (PDF, DOCX, etc.) to structured text/Markdown 5001

Also available:

VPN: WireGuard, OpenVPN, IPsec VPN, Headscale

Architecture

Quick start

Requirements:

A Linux server (local or cloud) with Docker installed

At least 8 GB of RAM (with small models). For larger LLM models (8B+), 32 GB or more is recommended.

You can comment out services you don't need to reduce memory usage.

Start the full stack:

Clone the repository to get the compose files

git clone https://github.com/hwdsl2/docker-ai-stack cd docker-ai-stack docker compose up -d

Pull a model (required before making LLM requests):

docker exec ollama ollama_manage --pull llama3.2:3b

Check the logs to confirm all services are ready:

docker compose logs

Run the health check to verify all services are working:

./stack-check.sh

Get the API keys:

Ollama API key

docker exec ollama ollama_manage --showkey

LiteLLM API key

docker exec litellm litellm_manage --showkey

MCP Gateway API key

docker exec mcp mcp_manage --showkey

Stop the stack:

docker compose down

GPU acceleration (NVIDIA CUDA)

For NVIDIA GPU acceleration, use the CUDA compose file:

docker compose -f docker-compose.cuda.yml up -d

Requirements: NVIDIA GPU, NVIDIA driver 535+, and the NVIDIA Container Toolkit installed on the host. CUDA images are linux/amd64 only.

Lightweight stacks

Don't need the full stack? Use a pre-configured subset from the stacks/ folder:

Stack Services Memory Use case

chat-ui Ollama + LiteLLM + AnythingLLM ~3 GB Web-based ChatGPT-like chat interface

voice-pipeline Whisper + Ollama + LiteLLM + Kokoro ~5 GB Speech-to-text → LLM → text-to-speech

rag-pipeline Ollama + LiteLLM + Embeddings ~3 GB Semantic search + LLM Q&A

rag-pipeline-full Ollama + LiteLLM + Embeddings + Docling ~4 GB Document parsing + semantic search + LLM Q&A

ai-tools Ollama + LiteLLM + MCP Gateway ~3 GB AI coding assistant with tool access

chat-only Ollama + LiteLLM ~2.5 GB Minimal local ChatGPT replacement

git clone https://github.com/hwdsl2/docker-ai-stack cd docker-ai-stack/stacks/chat-ui # or voice-pipeline, rag-pipeline, rag-pipeline-full, ai-tools, chat-only docker compose up -d

Running without Docker Compose

If you prefer using docker run commands directly, first create a shared network so services can communicate:

docker network create ai-stack

Then start each service on the shared network:

Ollama (LLM)

docker run -d --name ollama --restart always \ --network ai-stack \ -v ollama-data:/var/lib/ollama \ hwdsl2/ollama-server

LiteLLM (AI gateway)

docker run -d --name litellm --restart always \ --network ai-stack \ -p 4000:4000 \ -e LITELLM_OLLAMA_BASE_URL=http://ollama:11434 \ -v litellm-data:/etc/litellm \ hwdsl2/litellm-server

Embeddings

docker run -d --name embeddings --restart always \ --network ai-stack \ -p 127.0.0.1:8000:8000 \ -v embeddings-data:/var/lib/embeddings \ hwdsl2/embeddings-server

Whisper (STT)

docker run -d --name whisper --restart always \ --network ai-stack \ -p 127.0.0.1:9000:9000 \ -v whisper-data:/var/lib/whisper \ hwdsl2/whisper-server

Kokoro (TTS)

docker run -d --name kokoro --restart always \ --network ai-stack \ -p 127.0.0.1:8880:8880 \ -v kokoro-data:/var/lib/kokoro \ hwdsl2/kokoro-server

Docling (document parsing)

docker run -d --name docling --restart always \ --network ai-stack \ -p 127.0.0.1:5001:5001 \ -v docling-data:/var/lib/docling \ hwdsl2/docling-server

MCP Gateway

docker run -d --name mcp --restart always \ --network ai-stack \ -v mcp-data:/var/lib/mcp \ hwdsl2/mcp-gateway

Note: The shared network allows services to reach each other by container name (e.g., LiteLLM connects to Ollama via http://ollama:11434). You can start only the services you need — they don't all have to run together.

Pull a model (required before making LLM requests):

docker exec ollama ollama_manage --pull llama3.2:3b

Connect MCP Gateway to LiteLLM

LiteLLM and MCP Gateway are automatically wired when using the compose files in this repository — no manual key setup is needed.

API keys are shared automatically between services via Docker shared volumes:

Ollama generates an API key on first start and copies it to a shared volume

MCP Gateway does the same

LiteLLM reads both keys from the shared volumes on startup

The LITELLM_MCP_URL=http://mcp:3000/mcp and LITELLM_OLLAMA_BASE_URL=http://ollama:11434 environment variables are pre-configured in the compose files, so all services are connected automatically with a single docker compose up -d.

Once connected, AI clients that call LiteLLM can use MCP tools (filesystem, fetch, GitHub, etc.) directly through the LiteLLM proxy.

Voice pipeline example

Transcribe a spoken question, get a local LLM response via Ollama, and convert it to speech:

Tip: Need a sample audio file? Download this English speech sample (WAV, MIT License) from the Azure Samples repository:

curl -L -o sample_speech.wav \ "https://github.com/Azure-Samples/cognitive-services-speech-sdk/raw/master/sampledata/audiofiles/katiesteve.wav"

LITELLM_KEY=$(docker exec litellm litellm_manage --showkey | grep '^sk-' | head -1)

Step 1: Transcribe audio to text (Whisper)

TEXT=$(curl -s http://localhost:9000/v1/audio/transcriptions \ -F file=@sample_speech.wav -F model=whisper-1 | jq -r .text)

Step 2: Send text to Ollama via LiteLLM and get a response

RESPONSE=$(curl -s http://localhost:4000/v1/chat/completions \ -H "Authorization: Bearer $LITELLM_KEY" \ -H "Content-Type: application/json" \ -d "{\"model\":\"ollama/llama3.2:3b\",\"messages\":[{\"role\":\"user\",\"content\":\"$TEXT\"}]}" \ | jq -r '.choices[0].message.content')

Step 3: Convert the response to speech (Kokoro TTS)

curl -s http://localhost:8880/v1/audio/speech \ -H "Content-Type: application/json" \ -d "{\"model\":\"tts-1\",\"input\":\"$RESPONSE\",\"voice\":\"af_heart\"}" \ --output response.mp3

RAG pipeline example

Embed documents for semantic search, retrieve context, then answer questions with a local Ollama model:

LITELLM_KEY=$(docker exec litellm litellm_manage --showkey | grep '^sk-' | head -1)

Step 1: Embed a document chunk and store the vector in your vector DB

curl -s http://localhost:8000/v1/embeddings \ -H "Content-Type: application/json" \ -d '{"input": "Docker simplifies deployment by packaging apps in containers.", "model": "text-embedding-ada-002"}' \ | jq '.data[0].embedding'

→ Store the returned vector alongside the source text in Qdrant, Chroma, pgvector, etc.

Step 2: At query time, embed the question, retrieve the top matching chunks from

the vector DB, then send the question and retrieved context to Ollama via LiteLLM.

curl -s http://localhost:4000/v1/chat/completions \ -H "Authorization: Bearer $LITELLM_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "ollama/llama3.2:3b", "messages": [ {"role": "system", "content": "Answer using only the provided context."}, {"role": "user", "content": "What does Docker do?\n\nContext: Docker simplifies deployment by packaging apps in containers."} ] }' \ | jq -r '.choices[0].message.content'

MCP tools example

Use MCP Gateway to give your AI assistant access to files, web, and GitHub:

MCP_KEY=$(docker exec mcp mcp_manage --showkey | grep '^mcp-' | head -1)

Use MCP endpoint with an AI client (e.g., Cline in VS Code)

Set the MCP server URL: http://localhost:3000/mcp

Set Authorization header: Bearer

Or test the MCP endpoint directly with an initialize request

curl -s http://localhost:3000/mcp \ -X POST \ -H "Authorization: Bearer $MCP_KEY" \ -H "Content-Type: application/json" \ -H "Accept: application/json, text/event-stream" \ -d '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2025-03-26","capabilities":{},"clientInfo":{"name":"test","version":"1.0"}}}'

Customization

Each service can be configured with an optional env file. Copy the example env file from the respective repository, edit it, and uncomment the volume mount in docker-compose.yml:

Service Env file Repository

Ollama ollama.env docker-ollama

LiteLLM litellm.env docker-litellm

Embeddings embed.env docker-embeddings

Whisper whisper.env docker-whisper

WhisperLive whisper-live.env docker-whisper-live

Kokoro kokoro.env docker-kokoro

MCP Gateway mcp.env docker-mcp-gateway

Docling docling.env docker-docling

For detailed configuration options, API reference, and model management, see the documentation in each service's repository.

Internet-facing deployments

By default, all services listen over plain HTTP. For internet-facing deployments, place a reverse proxy (e.g., Caddy, Nginx, or Traefik) in front of the stack to provide HTTPS. Each service repository includes a detailed reverse proxy guide with Caddy and nginx examples. The chat-ui stack also includes a reverse proxy section specific to AnythingLLM.

When exposing services to the internet, set API keys for services that are optional-auth by default (Whisper, WhisperLive, Kokoro, Embeddings, Docling) via their respective env files.

Backup and restore

Your API keys, models, and configuration are stored in Docker volumes. Back up before upgrading or making changes:

Export API keys (while containers are running)

docker exec ollama ollama_manage --showkey docker exec litellm litellm_manage --showkey docker exec mcp mcp_manage --showkey

Back up all volumes (stop services first)

docker compose down mkdir -p backups for vol in ollama-data litellm-data embeddings-data whisper-data whisper-live-data k

[truncated for AI cost control]