2026-06-13站内改写3 min readUpdated: 2026-06-13

Whissle Gateway – Run Multi-Modal Voice AI Locally in a 500MB Docker

Whissle Gateway is a lightweight Docker container that runs multi-modal voice AI locally with a single command, including ASR, TTS, voice calling, diarization, metadata analysis, and AI coaching. Models download automatically with no cloud dependency, supporting a wide range of hardware from CPUs to high-end GPUs.

SourceHacker News AIAuthor: ksingla025

Run VoiceAI locally

ASR, TTS, voice calling, diarization, metadata, AI coaching — one Docker command.Models download automatically. No cloud dependency.

API DocsSolutions

Quick start

$ docker run -d --name whissle \ -p 9000:9000 -p 8001:8001 -p 8003:8003 \ -v whissle-models:/models -v whissle-data:/data \ -e VARIANT=en-full \ -e ANTHROPIC_API_KEY=your-key \ whissleasr/whissle-gateway:latest

VARIANT=

DEVICE=

en-full · Downloads ~2 GB on first run (cached after)

What happens when you run it:

═══════════════════════════════════════════════ Whissle Gateway — en-full ═══════════════════════════════════════════════ No GPU detected → using CPU

Shared models: ✓ speaker encoder + VAD 26 MB ✓ punctuation 254 MB ✓ ITN (English + Hinglish) 1.5 MB

Variant: en-full ✓ en-in-tech-misc (485 MB) ✓ KenLM ENGLISH (1.5 GB)

Auth: Mode: local Token: wh_a1b2c3d4e5f6... (admin) Manage: curl -H 'Authorization: Bearer ...' localhost:9000/auth/tokens

Starting services... PostgreSQL: :5432 ● ASR: :8001 ● TTS: :8003 ● Agent: :8765 ● Pipecat: :8000 ● Gateway: :9000 ●

API

Five interfaces — batch REST, streaming WebSocket, text-to-speech, voice calling, and an intelligent agent.

POST localhost:8001/transcribe

$ curl -X POST http://localhost:8001/transcribe \ -F "[email protected]" \ -F "diarize=true" \ -F "num_speakers=2" \ -F "punctuation=true" \ -F "metadata_prob=true" \ -F "summarize=sales_coaching" \ -o result.json

Response — transcript + metadata per segment + AI analysis

{ "segments": [ { "speaker": "SPEAKER_00", "text": "Hello, good morning.", "start": 1.0, "end": 1.9, "metadata": { "emotion": "EMOTION_NEUTRAL", "behavior": "BEHAVIOR_DIRECT", "role": "ROLE_INTERVIEWER", "age": "AGE_30_45", "gender": "GENDER_MALE" }, "words": [{"word": "Hello", "start": 1.0, "end": 1.3}] } ], "analysis": { "overall_score": 78, "buyer_outcome": "Converted", "practices": { "followed": 6, "total": 8 }, "highlights": [...] } }

Parameters

All parameters for POST /transcribe.

ParameterTypeDefaultDescription

filefilerequiredAudio file (MP3, WAV, FLAC, OGG, M4A)

languagestringautoLanguage hint: en, hi, zh

diarizeboolfalseSpeaker diarization

num_speakersintautoExact speaker count (if known)

punctuationbooltrueRestore punctuation and capitalization

itnbooltrueInverse text normalization (numbers, currency)

use_lmbooltrueKenLM language model beam search

metadata_probboolfalseProbability distributions for metadata

word_timestampsboolfalsePer-word start/end timestamps

speech_analysisboolfalseSpeech patterns (pace, fillers, fluency)

summarizestring—AI analysis: true, sales_coaching, collections, or custom prompt

hotwordsstring—Comma-separated hotwords for boosting

AI analysis modes

Add -F "summarize=mode" to any transcription. The diarized transcript + metadata is sent to Claude or Gemini for analysis.

sales_coaching

Sales Coaching

8 best practices scored. Rep/buyer identification. Highlights with timestamps. Behavior labels per segment. Overall score 0–100.

collections

Collections Compliance

Identity verification, reason stated, amount mentioned, no harassment. Call outcome (Promise to Pay / Dispute / Hardship). Next action.

true

General Summary

Overview, participants, key topics, emotional dynamics, entities, outcome. Markdown format.

your prompt here

Custom Prompt

Pass any prompt string. The LLM receives your instructions + full transcript with per-segment metadata.

Models

Each model extracts different metadata in a single ASR forward pass — no separate models or API calls.

en-in-tech-misc

485 MB

BEHAVIOREMOTIONEVALROLEAGEGENDERENTITY

120M params, 26 Behavioral codes for coaching, therapy, interviews. 8 evaluation labels.

English · 6 heads, 51 classes

hinglish-loans

479 MB

INTENTEMOTIONROLEAGEGENDERENTITY

115M params, Debt collection intents — pay-back, disputes, hardship. Agent/Customer role detection.

Hindi-English · 5 heads, 26 classes

627 MB

DIALECTAGEGENDERENTITY

160M params, Mandarin with North/South dialect detection.

Mandarin · 3 heads, 12 classes

whissle-large

2.4 GB

INTENTEMOTIONAGEGENDERENTITY

600M params, inline action tokens. 31 intent groups, 18K vocabulary.

23 languages · 5,500+ action tokens

Kokoro TTS

82 MB

55 voices

Non-autoregressive text-to-speech. Sub-200ms TTFB on CPU. Always included.

10 languages · Baked in

Punctuation + ITN

255 MB

CapitalizationNumbers

Punctuation restoration and inverse text normalization.

EN + Hinglish · Auto-downloaded

Metadata per segment

Every segment includes these tags. Common tags appear on all models. Additional tags depend on the model.

TagValuesModels

emotionEMOTION_NEUTRAL, EMOTION_HAPPY, EMOTION_SAD, EMOTION_ANGRY, EMOTION_FEAR, EMOTION_SURPRISEAll

ageAGE_0_18, AGE_18_30, AGE_30_45, AGE_45_60, AGE_60+All

genderGENDER_MALE, GENDER_FEMALEAll

behavior26 types (BEHAVIOR_EXPLAIN, BEHAVIOR_QUESTION, BEHAVIOR_ACKNOWLEDGE, ...)en-in-tech-misc

evalEVAL_CORRECT, EVAL_PROBE, EVAL_PARTIAL, EVAL_INCORRECT, EVAL_HINT, EVAL_SKIPen-in-tech-misc

roleROLE_INTERVIEWER / ROLE_INTERVIEWEE or ROLE_AGENT / ROLE_CUSTOMERen-in-tech-misc, hinglish-loans

intent13 collections intents or 31 general intents (INTENT_GREETING, INTENT_QUESTION, ...)hinglish-loans, whissle-large

dialectDIALECT_NORTH, DIALECT_SOUTH, DIALECT_OTHERSzh

Variants

Choose your variant based on language and quality needs. Switch by changing VARIANT= and restarting. Cached models are reused.

VariantLanguagesDownloadBest for

hinglishHindi-English~515 MBDebt collections, Hindi-English call centers

en-liteEnglish~500 MBQuick testing, development

en-full★English~2 GBSales coaching, interviews, therapy

multi-full23 languages~4 GBMultilingual, highest quality

multi-zh23 langs + Mandarin~5 GBMultilingual + dialect detection

allAll~6 GBMaximum flexibility

Runs everywhere

From your laptop (CPU) to data center GPUs. Same Docker, same API. Auto-detects GPU.

HardwareVRAMVariantConcurrent

MacBook / LaptopCPUAny1–3

Mac Mini M4 Pro24 GB unifieden-full3–8

NVIDIA T416 GBen-lite5–10

RTX 409024 GBen-full20–50

A100 40GB40 GBmulti-full50–80

RTX 6000 Ada48 GBall50–100

H10080 GBall150–300

DGX Spark128 GB unifiedall30–60

H200141 GBall250–500

Docker TagArchRuntime

whissleasr/whissle-gateway:latestamd64CPU — Mac (Rosetta), Linux, Windows

whissleasr/whissle-gateway:gpuamd64NVIDIA CUDA 12.4 + onnxruntime-gpu

Architecture

┌──────────────────────────────────────────────────────────────┐ │ Docker Container │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │ │ │ ASR │ │ TTS │ │ Pipecat │ │ Agent │ │ │ │ :8001 │ │ :8003 │ │ :8000 │ │ :8765 │ │ │ │ │ │ Kokoro │ │ │ │ Claude / │ │ │ │ ONNX │ │ 82M │ │ WebRTC │ │ Gemini API │ │ │ │ +KenLM │ │ 55 voice │ │ Twilio │ │ │ │ │ │ +ECAPA │ │ │ │ Voice AI │ │ Summarize │ │ │ │ +VAD │ │ │ │ │ │ Coach │ │ │ │ +Punct │ │ │ │ Auth │ │ Analyze │ │ │ │ +ITN │ │ │ │ Multi-org│ │ │ │ │ └──────────┘ └──────────┘ └──────────┘ └──────────────┘ │ │ │ │ │ ┌──────────────┐ │ │ │ PostgreSQL │ │ │ │ :5432 │ │ │ └──────────────┘ │ │ │ │ /models (Docker volume — cached ASR models) │ │ /data (Docker volume — PostgreSQL, auth, conversations) │ └──────────────────────────────────────────────────────────────┘

whissle-models volume

ASR models, KenLM, punctuation, ITN. Downloaded on first run, cached forever. Survives container restarts.

whissle-data volume

Conversations, analytics, agent configs, auth tokens. Persists across restarts. Only deleted by docker volume rm.

Get started

One command. Models download automatically. Ready in 2 minutes. Built for contact centers, sales intelligence, behavioral AI, and more.

API DocsView SolutionsContact Us