Why standard WER fails for Indian languages
This article analyzes the limitations of standard WER/CER in evaluating Indian language ASR systems and proposes a layered LLM-based evaluation approach, including LLM-WER, LLM-CER, Intent Score, and Entity Preservation Score, to more accurately reflect system performance.
Evaluating Indian Language ASR
A practical guide to layered Indic ASR evaluation: LLM-WER and LLM-CER, Intent and Entity scores, COMET, and open-source evaluation frameworks.
Research
April 2, 2026·14 min read
Introduction
Measuring how well a speech recognition system performs in Indian languages is harder than it looks. The standard metrics weren't built for them, and that mismatch quietly distorts how Indic ASR systems get evaluated.
Word Error Rate(WER), Character Error Rate(CER), and BLEU were developed primarily for English. They work well when every word has a single accepted spelling, when languages don't mix mid-sentence, and when the gap between formal and colloquial usage is narrow. Indian languages don't fit that description. Colloquial and formal registers coexist and are equally understood by speakers. English loanwords appear in both Indic and Latin script, sometimes within the same utterance. Numbers have multiple valid written forms. Applying these metrics without adjustment can make an Indic ASR system look significantly worse or better than it actually performs in practice.
This is a harder problem than it first appears. It isn't just that the metrics are imperfect. The deeper issue is that WER and CER penalize surface-level differences in character sequences without any understanding of whether two transcriptions mean the same thing. When a model correctly transcribes a spoken word but renders it in a different but equally valid script or spelling, the metric counts that as an error. The transcript is right. The score is wrong.
This blog describes a layered evaluation approach that addresses these gaps directly. It explains what WER and CER measure and where they break down, then introduces four LLM-based metrics - LLM-WER, LLM-CER, Intent Score, and Entity Preservation Score. Together, they give a more accurate and complete picture of how an Indic ASR system is actually performing. The blog also introduces two open-source evaluation frameworks you can drop into an existing pipeline.
Throughout, we draw on examples from Saaras V3, Sarvam's speech recognition API for 22 Indian languages. Saaras V3 supports five output modes - transcription, translation, verbatim output, transliteration, and code-mix - which makes it a useful concrete anchor for the broader evaluation principles discussed. The open-source frameworks below can be adapted to evaluate any Indic ASR system.
llm_wer
llm_intent_entity
We don't think this approach is the final word on Indic ASR evaluation. The field is still developing, and the right set of metrics will likely evolve as the systems themselves do. But we believe this layered framework is meaningfully closer to what evaluation for Indian languages actually requires, and the tools exist today to start using it.
Working Example: Saaras V3 by Sarvam
This section below covers Saaras V3's output modes and delivery options. The metric discussions that follow reference these modes directly, so it helps to have them in view before diving in. If you're evaluating a different ASR system, the same principles apply. Substitute your system's equivalent modes and endpoints where relevant.
Output Modes
To make the differences between modes concrete, all five examples below are drawn from the same spoken Hindi sentence:
Input sentence (spoken Hindi) मुझे कल सुबह नौ बजे doctor के पास जाना है
Note the code-mixed English word 'doctor' and the spoken number 'नौ'
ModeWhat it returnsExample output from the sentence abovePrimary evaluation metric
TranscribeNormalised transcript with numbers, punctuation, and formattingमुझे कल सुबह 9 बजे डॉक्टर के पास जाना है।
Note:
'नौ' → '9' - spoken number normalised to digit
'doctor' → 'डॉक्टर' - loanword transliterated to Devanagari
Full stop added.
LLM-WER / LLM-CER
TranslateEnglish translation of the spoken inputI need to go to the doctor tomorrow morning at 9.Intent Score + Entity Score + COMET
VerbatimExact word-for-word output, no normalisationमुझे कल सुबह नौ बजे डॉक्टर के पास जाना हैStandard WER (strict)
TranslitIndic script converted to the Latin alphabetMujhe kal subah nau baje doctor ke paas jaana hai.LLM-CER
CodemixMixed output preserving Indic and English tokensमुझे कल सुबह 9 बजे doctor के पास जाना है।
Note:
'नौ' → '9' - number normalised
'doctor' stays in Latin script rather than being transliterated
The code-mixed nature of the speech is preserved.
LLM-WER + Entity Score
Scroll
Scroll
API Delivery Options
Saaras V3 is available through three delivery methods. The right choice depends on your file length, latency requirements, and integration architecture.
APIBest forLimitsResponse typeLatency
REST APISingle short files, webhook integrations, synchronous pipelinesUp to 30 seconds per fileSynchronous result returned in the same HTTP response2-5 seconds
Batch APILong recordings, bulk jobs, overnight pipelines, call centre archivesUp to 60 minutes per file; up to 20 files per requestAsynchronousreturns a job ID; poll for resultsMinutes to hours depending on queue
WebSocket StreamingReal-time voice assistants, live captions, interactive conversational botsContinuous audio stream; no fixed file size limitReal-time partial results as audio arrivesSub-second to first word
Scroll
Scroll
Metrics Overview
An overview of all five metrics covered in this blog, organised by category.
Traditional Metrics:String-matching and n-gram overlap, designed primarily for English.
WER / CER (Section 3): Word/Character Error Rate. Edit-distance metric; fast, well-understood, and the standard benchmark baseline.
COMET (Section 6): Neural translation metric trained on human judgements. Better than BLEU for measuring translation fluency.
BLEU (Section 7.2): N-gram overlap for translation quality. Included as context; superseded by COMET for modern use.
Gen-AI Metrics:LLM-based metrics that measure meaning, not just character distance.
LLM-WER / LLM-CER (Section 5): WER/CER rescored by an LLM judge. Segments that are semantically or phonetically equivalent are no longer counted as errors.
Intent Score (Section 7): Binary score (0 or 1). An LLM judges whether the core meaning of the utterance is preserved.
Entity Preservation Score (Section 8): Float between 0 and 1. The fraction of named entities (names, places, numbers, dates) that appear correctly in the transcription.
Indian languages have multiple valid forms for the same spoken word, colloquial and formal registers, code-mixed English loanwords, numbers written in digits or spelled out. Traditional metrics treat all of these as errors. Section 4 shows six concrete failure classes in detail, and the Gen-AI metrics in Sections 5 - 8 are designed to handle each of them.
Metric 1: Standard WER and CER
WER and CER are the traditional baselines for ASR evaluation. They are well-understood, fast to compute and genuinely useful in specific circumstances.
Definitions
Word Error Rate (WER) measures the edit distance between the ASR output and a reference transcript at the word level counting substitutions (S), deletions (D) and insertions (I) divided by the total reference word count (N):
WER = (Substitutions + Deletions + Insertions) / Total Reference Words
Word Error Rate (conceptual formula)
Character Error Rate (CER) applies the same formula at the character level. It is preferred over WER for agglutinative languages (Malayalam, Kannada, Telugu) where a single word token can be very long, making word-level edit distance disproportionately punishing.
When WER and CER Are Still Useful
mode="verbatim": exact transcription is the requirement WER is the correct strict measure
Benchmarking against published numbers on datasets like Vistaar or IndicVoices, which report raw WER
As a complementary baseline alongside LLM-based metrics, always report both
A note on evaluation strategy: For Indian languages, WER and CER work best as one layer in a multi-metric strategy, not as a standalone quality gate. Section 4 walks through the specific scenarios where LLM-based metrics give a more accurate picture.
Understanding the limits of Standard Metrics for Indian Languages
WER and CER were designed for English, which has a near one-to-one relationship between spoken words and written tokens, fixed spelling conventions, and no code-mixing in standard speech. Indian languages have more flexibility built in. That flexibility is a feature, not a problem. The six scenarios below illustrate where standard metrics misread a correct transcription as an error, and why LLM-based metrics give a fairer read.
Colloquial Variants: When 'Wrong' Is Right
Every Indian language has a formal written register and a colloquial spoken register. Native speakers use both interchangeably and understand both perfectly. WER treats any deviation from the reference form as an error, regardless of whether meaning is preserved.
Here is a concrete walkthrough for Tamil: how the reference, ASR output, metric verdicts, and real-world impact line up.
What the speaker said - A casual, colloquial Tamil sentence spoken naturally
Reference (formal) - அவர்கள் ஒன்றாக வேலை செய்கிறார்கள் (avargal: "they work together")
Example ASR output - அவுங்க ஒண்ணா வேலை செய்றாங்க (avunga: colloquial form, identical meaning)
Standard WER verdict - ❌ 4 out of 5 words flagged as errors. WER = 80%.
Reality - ✅ A native Tamil speaker hears this as a perfect transcription.
Business impact - If you set a WER threshold of 20% for your Tamil voice bot, you will reject a high-quality ASR output and may never ship.
Code-Mixing: The Script Mismatch Trap
Hundreds of millions of Indians code-mix naturally, switching between their native language and English mid-sentence. When the ASR and the reference annotator make different but equally valid choices about how to write an English word, WER registers it as an error.
Here is a concrete walkthrough for Hindi (code-mixed with English): how the reference, ASR output, metric verdicts, and real-world impact line up.
Reference - वह doctor के पास गया (English word kept in Latin script)
Example ASR output - वह डॉक्टर के पास गया (same word transliterated to Devanagari)
Standard WER verdict - ❌ Flags 'doctor' as a substitution. WER = 20% on this sentence.
Reality - ✅ Identical meaning. Both spellings are correct.
Business impact - A customer service bot for a bank handles code-mixed Hindi all day. Every loanword ('account', 'balance', 'transfer', 'nominee') is a potential false WER error. The model looks 15% worse than it is.
Short Helper Words: Exponential Penalties
Hindi, Bengali, and Marathi rely on short helper words, typically 2 to 3 characters. WER's division-by-word-count formula produces skewed scores when these words have minor deviations.
Here is a concrete walkthrough for Hindi: how the reference, ASR output, metric verdicts, and real-world impact line up.
Reference - नहीं ("no": 2 characters)
Example ASR output (echo repeat) - नहीं नहीं (word echoed once due to audio processing)
Standard WER verdict - ❌ WER = 300%. The model appears catastrophically wrong.
Second reference - है ("is": a 2-character helper word)
Example ASR output (diacritic drift) - हई (phonetically near-identical, minor diacritic shift)
Standard WER verdict - ❌ WER = 100% on this word; a complete failure for a single diacritic.
Reality - ✅ A Hindi speaker hears no meaningful difference in either case.
Agglutinative Languages: Suffix Substitutions
Malayalam, Kannada, and Telugu build long compound words by chaining morphemes. A minor suffix substitution, grammatically trivial, creates a large CER penalty because the entire token is affected.
Here is a concrete walkthrough for Malayalam: how the reference, ASR output, metric verdicts, and real-world impact line up.
Reference - വിദ്യാർത്ഥികളുമായി സംസാരിച്ചു ("spoke with the students"; suffix: ഉമായി)
Example ASR output - വിദ്യാർത്ഥികളൊടു സംസാരിച്ചു (same meaning, grammat
[truncated for AI cost control]