AI News HubLIVE
站内改写5 min read

Beyond Transcription: ASR Model Delivers Words, Emotion, and Intent in 200ms

Whissle's META-1 is a meta-aware ASR model that simultaneously outputs transcription and metadata (emotion, intent, age, gender, etc.) in a single forward pass at ~200ms latency. By integrating KenLM n-gram language models, it reduces word error rates by up to 3.6% absolute (10.8% relative) across four languages, while extracting metadata 9x faster than commercial alternatives like Deepgram, AssemblyAI, and Gemini 2.0 Flash.

SourceHacker News AIAuthor: ksingla025

Beyond Transcription: How a Meta-Aware ASR Model Delivers Words, Emotion, and Intent in 200ms

By Whissle Research Team

Apr 16 2026

47

892

Most speech recognition systems give you words. Just words — a flat stream of text with timestamps. If you want to know how something was said — the speaker's emotion, their intent, how fast they're talking, whether they're using filler words — you need a separate pipeline: send the transcript to an LLM, call a sentiment API, run a classifier. Each step adds latency, cost, and complexity.

Whissle takes a fundamentally different approach. Our Meta-aware Voice Action Model (META-1) is trained on a vocabulary that includes both regular text tokens and metadata action tokens — EMOTION_HAPPY, INTENT_QUESTION, AGE_30_45, GER_FEMALE, SPEAKER_CHANGE. The CTC decoder outputs these inline with the transcript in a single forward pass. One model, one stream, one latency budget — transcription and understanding together at ~200ms.

But CTC-based models have a well-known weakness: they decode each audio frame independently, with zero knowledge of language. The result is mangled word boundaries, phonetic guesses where real words should be, and transcripts that look like someone typed with their elbows. This problem is compounded when the model's vocabulary includes ~10,000 metadata tokens alongside ~8,000 text tokens — the decoder must navigate a much larger output space.

The fix is a traditional n-gram language model — not a neural network. N-gram models (built with KenLM) are essentially lookup tables of word sequence probabilities. They run in sub-millisecond time, need no GPU, operate at CTC frame rate, and carry zero hallucination risk. Instead of picking the single most likely token at each frame (greedy decoding), beam search explores multiple hypotheses and scores them against the n-gram model to find which word sequences actually occur in a language.

We benchmarked this system across four languages — English, Spanish, German, and Hindi — with 1,300 real-world audio samples and five provider configurations. The language model reduced word error rates by up to 3.6% absolute (10.8% relative) on German and Spanish, while the model simultaneously streamed emotion, intent, and demographics at ~200ms — 9x faster than the next closest metadata solution.

This post covers three questions:

How does a meta-aware ASR model compare against commercial providers — Deepgram Nova-3, AssemblyAI, and Gemini 2.0 Flash — across four languages?

Does adding a KenLM n-gram language model to the CTC decoder measurably improve accuracy without sacrificing the model's metadata capabilities?

What's the real cost of getting metadata from each provider — in latency, accuracy, and architectural complexity?

What Changed Since Our Last Benchmark

Our previous benchmark tested English-only with Whissle on CPU. This update introduces five major changes:

Meta-aware model framing. This benchmark evaluates Whissle's META-1 architecture — a single CTC model that emits transcription tokens and metadata action tokens (emotion, intent, age, gender, speaker change) in one forward pass. Previous benchmarks focused only on transcription accuracy.

GPU acceleration. Whissle now runs on NVIDIA L4 GPUs via Cloud Run (us-east4), replacing the CPU-only ONNX runtime. Real-time throughput improved significantly.

N-gram language model integration. KenLM-based 3-gram models, trained on AM training data transcriptions, are fused into CTC beam search decoding. Critically, the LM operates only on text tokens — metadata action tokens are suppressed after log-softmax normalization to preserve proper probability distributions.

Multilingual benchmarking. English, Spanish, German, and Hindi — with language-matched LM models. Deepgram upgraded from Nova-2 to Nova-3 (their latest multilingual model).

Metadata extraction benchmarking. We compared the latency and capability of getting metadata (emotion, intent, sentiment, entities) from three approaches: Whissle's single-stream real-time metadata (~200ms), Gemini 2.0 Flash via LLM prompting (1.8–2.2s batch), and Deepgram's batch Audio Intelligence API (0.9–1.2s batch).

The result is a five-provider comparison: Whissle greedy (pure acoustic model with streaming metadata), Whissle + LM (beam search with KenLM), Deepgram Nova-3, AssemblyAI Universal Streaming, and Gemini 2.0 Flash (batch LLM transcription) — tested across four languages with 1,300 total samples.

How We Tested

All providers were tested using real-time WebSocket streaming -- the exact protocol you'd use in production. Audio was streamed in 100ms chunks at 1x real-time speed, simulating a live microphone.

ProviderModel / ConfigStreaming Endpoint

Whissle (greedy)GPU + CTC greedy decodewss://api.whissle.ai/asr/stream

Whissle + LMGPU + CTC beam search + KenLMwss://api.whissle.ai/asr/stream

DeepgramNova-3 (multilingual)wss://api.deepgram.com/v1/listen

AssemblyAIUniversal Streamingwss://streaming.assemblyai.com/v3/ws

Gemini 2.0 FlashLLM batch transcriptiongenerativelanguage.googleapis.com (REST)

Fair comparison measures:

All providers received identical audio (PCM int16, mono, 16kHz)

Both Whissle configurations (greedy and +LM) ran sequentially on each sample, then Deepgram, AssemblyAI, and Gemini ran concurrently

Text normalized before WER: lowercased, punctuation stripped, whitespace collapsed

WER computed using the standard jiwer library

Language parameter set for each provider (e.g. language=es for Spanish)

Note on Gemini: Gemini 2.0 Flash is a batch LLM, not a streaming ASR service. Audio is sent as a single request and the full transcript returned at once. Latency numbers reflect total round-trip time, not streaming first-segment latency. This is a fundamentally different architecture -- included to compare accuracy between traditional ASR and LLM-based transcription.

Key metrics:

MetricWhat It Measures

WER (mean)Word Error Rate -- (Insertions + Deletions + Substitutions) / Reference Words, averaged per sample

WER (median)Median per-sample WER -- robust to outliers

CERCharacter Error Rate -- same formula at character level

Time to first segmentMilliseconds from audio stream start to first non-empty transcript

RTFXReal-time factor -- how much faster than real-time the transcription completes

Failure ratePercentage of samples that returned no usable transcript

The Four Datasets

We chose four speech datasets -- one per language -- from standard academic benchmarks. The European languages use read-speech audiobook narration from LibriSpeech. Hindi uses Meta's conversational speech corpus, testing the system against dialect variation and code-switching.

Dataset 1: LibriSpeech test-clean (English)

The industry baseline. Clean, studio-quality audiobook narration in American English.

PropertyValue

Sourceopenslr/librispeech_asr (HuggingFace)

Samples100 (from test split)

Total audio670.6 seconds (~11.2 minutes)

Total reference words1,870

LanguageEnglish (US)

KenLM modelENGLISH.bin (trained on English Wikipedia)

Dataset 2: Multilingual LibriSpeech -- Spanish

Read speech in Castilian and Latin American Spanish, from the Multilingual LibriSpeech corpus.

PropertyValue

SourceWhissleAI/multilingual-libri-test-spanish (HuggingFace)

Samples100

Total audio1,488.4 seconds (~24.8 minutes)

Total reference words3,250

LanguageSpanish

KenLM modelEUROPEAN.bin (trained on Spanish, French, German, Portuguese, Italian, and 10 more European language Wikipedias)

Dataset 3: Multilingual LibriSpeech -- German

Read speech in Standard German, from the same Multilingual LibriSpeech family.

PropertyValue

SourceWhissleAI/multilingual-libri-test-german (HuggingFace)

Samples100

Total audio1,451.8 seconds (~24.2 minutes)

Total reference words2,680

LanguageGerman

KenLM modelEUROPEAN.bin (same model as Spanish -- covers all European group languages)

Dataset 4: Meta STT Hindi (Hindi)

Conversational Hindi with heavy dialect variation (Bihar, UP, etc.), noise markers, and code-switching. A significantly harder benchmark than read speech.

PropertyValue

SourceWhissleAI/Meta_STT_HI_Set1 (HuggingFace)

Samples1,000 (from test split)

Total audio~7,500 seconds (~125 minutes)

Total reference words22,055

LanguageHindi (with dialect variation)

KenLM modelINDO_ARYAN.bin (trained on Hindi, Marathi, Bengali, Gujarati, and Urdu Wikipedia)

NoteAssemblyAI was not tested -- their streaming API does not support Hindi

The N-Gram Language Model: How It Works

Before diving into results, it's worth understanding how the language model integrates with a CTC decoder that outputs both text and metadata tokens. This is the key engineering challenge: Whissle's META-1 model has ~18,189 tokens in its vocabulary, of which ~9,919 are metadata action tokens (EMOTION_*, INTENT_*, AGE_*, GENDER_*, ENTITY_*, SPEAKER_CHANGE). The language model must enhance transcription accuracy without interfering with metadata prediction.

The Problem: CTC Decoders and the Metadata Vocabulary

A CTC (Connectionist Temporal Classification) model processes audio frame by frame and outputs a probability distribution over the entire token vocabulary at each time step — including both text tokens and metadata action tokens. A greedy decoder picks the most likely token at each frame, collapses consecutive duplicates and blanks, and produces an interleaved stream of text and metadata. It's fast — but it only considers acoustic evidence. It has no knowledge of what words are likely to follow other words.

This means the greedy decoder makes mistakes that a human reader would immediately catch: "he went too the store" instead of "he went to the store." The acoustic signals for "too" and "to" are nearly identical. A language model knows that "to the" is far more probable than "too the" and can correct this.

KenLM: N-Gram Language Models from AM Training Data

The n-gram language models are trained on the text transcriptions from our acoustic model's training data -- the same multilingual speech corpus used to train the Whissle ASR model itself. This is a critical design choice: the LM learns the word distribution and n-gram statistics of the same domain the acoustic model was trained on, ensuring tight alignment between what the model hears and what the LM expects.

  1. Training data extraction. The Whissle acoustic model is trained on a large-scale multilingual speech corpus covering 35+ languages across 6 tokenizer groups. For each group, we extract the reference transcriptions from the training manifests -- the ground-truth text labels paired with the audio. This yields a word-level text corpus of over 100 million words across all groups, with each group's corpus reflecting the actual vocabulary and sentence patterns the ASR model encounters.
  1. Text normalization. Transcription text is NFKC-normalized, lowercased, and stripped of non-word characters (preserving Unicode ranges per script -- Latin, Cyrillic, Devanagari, CJK, etc.). Metadata tags (AGE_*, GENDER_*, EMOTION_*, etc.) are removed. This produces a clean word-level corpus per group.
  1. N-gram training. We use KenLM's lmplz to train a word-level 3-gram model with pruning thresholds 0 0 1 (keep all unigrams and bigrams, prune trigrams seen fewer than 2 times). The ARPA model is converted to compact binary using build_binary. A unigram word list (top 500K words) is extracted for beam search vocabulary constraint.
  1. Per-group deployment. The final artifacts are ENGLISH.bin, EUROPEAN.bin, INDO_ARYAN.bin, etc. -- one per tokenizer group. At server startup, these are loaded into CTCBeamSearchDecoder instances. When a request specifies language=es, the engine resolves Spanish to the EUROPEAN group and uses the corresponding decoder.

CTC Beam Search with Shallow Fusion

The beam search decoder combines two independent scores for each candidate word sequence via shallow fusion:

score(w) = log Pacoustic(w) + α · log

[truncated for AI cost control]