QIMMA قِمّة ⛰: A Quality-First Arabic LLM Leaderboard
QIMMA (Arabic for 'summit') is a quality-first Arabic LLM leaderboard that validates benchmarks before evaluation, revealing systematic quality issues in widely-used Arabic benchmarks. It consolidates 109 subsets from 14 benchmarks across 7 domains, applies multi-model automated assessment and human review, and ranks models with a focus on native Arabic capability. The leaderboard is the first to include code evaluation for Arabic LLMs.
Article intelligence
Key points
- QIMMA applies rigorous quality validation to Arabic benchmarks before model evaluation, uncovering significant errors and cultural biases.
- The leaderboard consolidates over 52,000 samples from 14 benchmarks, spanning cultural, STEM, legal, medical, safety, poetry, and coding domains.
- Top-ranked models include Qwen3.5-397B, Karnak, and Jais-2-70B-Chat, with Arabic-specialized models leading on cultural/linguistic tasks.
- Code evaluation reveals that multilingual models outperform Arabic-specialized ones on programming tasks.
Why it matters
This matters because QIMMA applies rigorous quality validation to Arabic benchmarks before model evaluation, uncovering significant errors and cultural biases.
Technical impact
May affect model selection, inference cost, product capability, and evaluation benchmarks.
QIMMA قِمّة ⛰: A Quality-First Arabic LLM Leaderboard
Hugging Face
Models
Datasets
Spaces
Buckets new
Docs
Enterprise
Pricing
Log In
Sign Up
Back to Articles
QIMMA قِمّة ⛰: A Quality-First Arabic LLM Leaderboard
Community Article Published April 21, 2026
Upvote
11
+5
Leen AlQadi
LeenAlQadi
Follow
tiiuae
Ahmed Alzubaidi
amztheory
Follow
tiiuae
Mohammed Alyafeai
Alyafeai
Follow
tiiuae
Maitha Alhammadi
MaithaAlhammadi
Follow
tiiuae
Shaikha Alsuwaidi
Shaikha710
Follow
tiiuae
Omar saif alkaabi
Omar-Alkaabi
Follow
tiiuae
Basma Boussaha
basma-b
Follow
tiiuae
Hakim Hacid
HakimHacid
Follow
tiiuae
🔍 The Problem: Arabic NLP Evaluation Is Fragmented and Unvalidated
⛰ What's in QIMMA?
🔬 The Quality Validation Pipeline
Stage 1: Multi-Model Automated Assessment
Stage 2: Human Annotation and Review
⚠️ What We Found: Systematic Quality Problems
By the Numbers
Taxonomy of Issues Found
💻 Code Benchmark: A Different Kind of Quality Work
⚙️ Evaluation Setup
Evaluation Framework
Metrics by Task Type
Prompt Templates
🏆 Leaderboard Results
The Size-Performance Relationship
🌟 What Makes QIMMA Different
🔗 Resources
🔖 Citation
QIMMA validates benchmarks before evaluating models, ensuring reported scores reflect genuine Arabic language capability in LLMs.
🏆 Leaderboard · 🔧 GitHub · 📄 Paper
If you've been tracking Arabic LLM evaluation, you've probably noticed a growing tension: the number of benchmarks and leaderboards is expanding rapidly, but are we actually measuring what we think we're measuring?
We built QIMMA قمّة (Arabic for "summit"), to answer that question systematically. Instead of aggregating existing Arabic benchmarks as-is and running models on them, we applied a rigorous quality validation pipeline before any evaluation took place. What we found was sobering: even widely-used, well-regarded Arabic benchmarks contain systematic quality issues that can quietly corrupt evaluation results.
This post walks through what QIMMA is, how we built it, what problems we found, and what the model rankings look like once you clean things up.
🔍 The Problem: Arabic NLP Evaluation Is Fragmented and Unvalidated
Arabic is spoken by over 400 million people across diverse dialects and cultural contexts, yet the Arabic NLP evaluation landscape remains fragmented. A few key pain points have motivated this work:
Translation issues. Many Arabic benchmarks are translations from English. This introduces distributional shifts. Questions that feel natural in English become awkward or culturally misaligned in Arabic, making benchmark data less representative of how Arabic is naturally used.
Absent quality validation. Even native Arabic benchmarks are often released without rigorous quality checks. Annotation inconsistencies, incorrect gold answers, encoding errors, and cultural bias in ground-truth labels have all been documented in established resources.
Reproducibility gaps. Evaluation scripts and per-sample outputs are rarely released publicly, making it hard to audit results or build on prior work.
Coverage fragmentation. Existing leaderboards cover isolated tasks and narrow domains, making holistic model assessment difficult.
To illustrate where QIMMA sits relative to existing platforms:
Leaderboard Open Source Native Arabic Quality Validation Coding Eval Public Outputs
OALL v1✅Mixed❌❌✅
OALL v2✅Mostly❌❌✅
BALSAMPartial50%❌❌❌
AraGen✅100%✅❌❌
SILMA ABL✅100%✅❌✅
ILMAAMPartial100%✅❌❌
HELM Arabic✅Mixed❌❌✅
⛰ QIMMA ✅ 99% ✅ ✅ ✅
QIMMA is the only platform combining all five properties: open source, predominantly native Arabic content, systematic quality validation, code evaluation, and public per-sample inference outputs.
⛰ What's in QIMMA?
QIMMA consolidates 109 subsets from 14 source benchmarks into a unified evaluation suite of over 52,000 samples, spanning 7 domains:
Domain Benchmarks Task Types
CulturalAraDiCE-Culture, ArabCulture, PalmXMCQ
STEMArabicMMLU, GAT, 3LM STEMMCQ
LegalArabLegalQA, MizanQAMCQ, QA
MedicalMedArabiQ, MedAraBenchMCQ, QA
SafetyAraTrustMCQ
Poetry & LiteratureFannOrFlopQA
Coding3LM HumanEval+, 3LM MBPP+Code
A few things stand out about this design:
99% native Arabic content. The only exception is code evaluation, which is inherently language-agnostic.
First Arabic leaderboard with code evaluation. QIMMA integrates Arabic-adapted versions of HumanEval+ and MBPP+, making it possible to assess coding capability with Arabic-language problem statements.
Diversity in Domains and Tasks. QIMMA evaluates real-world competency areas including education, governance, healthcare, creative expression, and software development.
🔬 The Quality Validation Pipeline
This is the methodological heart of QIMMA. Before running a single model, we applied a multi-stage validation pipeline to every sample in every benchmark.
Stage 1: Multi-Model Automated Assessment
Each sample was independently evaluated by two state-of-the-art LLMs:
Qwen3-235B-A22B-Instruct
DeepSeek-V3-671B
We chose two models with strong Arabic capability but different training data compositions, so that their combined judgment is more robust than either alone.
Each model scores a sample against a 10-point rubric, with binary scores (0 or 1) per criterion:
A sample is eliminated if either model scores it below 7/10. Samples where both models agree on elimination are dropped immediately. However, where only one model flags a sample, it proceeds to human review in Stage 2.
Stage 2: Human Annotation and Review
Flagged samples are reviewed by native Arabic speakers with cultural and dialectal familiarity. Human annotators make final calls on:
Cultural context and regional variation
Dialectal nuance
Subjective interpretation
Subtle quality issues automated assessment may miss
For culturally sensitive content, multiple perspectives are considered, since "correctness" can genuinely vary across Arab regions.
⚠️ What We Found: Systematic Quality Problems
The pipeline revealed recurring quality issues across benchmarks; not isolated errors, but systematic patterns reflecting gaps in how benchmarks were originally constructed.
By the Numbers
Benchmark Total Samples Discarded Discard Rate
ArabicMMLU 14,163 436 3.1%
MizanQA1,769412.3%
PalmX3,001250.8%
MedAraBench4,960330.7%
FannOrFlop6,984430.6%
ArabCulture3,48270.2%
MedArabiQ49910.2%
GAT13,9861~0.0%
3LM STEM2,6091~0.0%
AraDiCE-Culture18000.0%
ArabLegalQA7900.0%
AraTrust52200.0%
Taxonomy of Issues Found
⚖️ Answer Quality
False or mismatched gold indices, factually wrong answers, missing or raw text answers.
📄 Text & Formatting Quality
Corrupt or illegible text, spelling and grammar errors, and duplicate samples.
💬 Cultural Sensitivity
Stereotype reinforcement and monolithic generalizations about diverse communities.
🤝 Gold Answer Compliance
Misalignment of gold answers with evaluation protocols.
💻 Code Benchmark: A Different Kind of Quality Work
Code benchmarks required a different intervention. Rather than discarding samples, we refined the Arabic problem statements in 3LM's Arabic adaptations of HumanEval+ and MBPP+, leaving task identifiers, reference solutions, and test suites completely unchanged.
The modification rates were striking:
Benchmark Total Prompts Modified Unchanged Modification Rate
3LM HumanEval+1641451988%
3LM MBPP+3783087081%
Modifications fell into five categories:
Linguistic refinement : normalizing toward natural Modern Standard Arabic and consistent imperative style
Clarity improvements : fixing ambiguous instructions and unclear constraints
Consistency normalization : standardizing mathematical terminology, punctuation, and example formatting
Structural corrections : fixing broken triple-quoted strings, indentation errors, corrupted text fragments
Semantic refinements : clarifying whether ranges are inclusive/exclusive, preserving task intent
⚙️ Evaluation Setup
Evaluation Framework
QIMMA uses LightEval, EvalPlus and FannOrFlop as its evaluation framework, chosen for consistency, multilingual community adoption, and reproducibility.
Metrics by Task Type
Task Type Metric Benchmarks
MCQNormalized Log-Likelihood AccuracyAraDiCE-Culture, ArabicMMLU, ArabCulture, PalmX, 3LM STEM, MedArabiQ, GAT, MedAraBench, AraTrust
Multi-select MCQProbability Mass on Gold ChoicesMizanQA
Generative QAF1 BERTScore (AraBERT v02)MedArabiQ, ArabLegalQA, FannOrFlop
CodePass@13LM HumanEval+, 3LM MBPP+
Prompt Templates
QIMMA standardizes prompting by question format, with six template types:
MCQ: generic multiple choice · MCQ-C: multiple choice with context passage · MCQ-I: multiple choice with specific instructions (GAT analogy/completion) · QA: generic open-ended QA · QA-C: QA with context · QA-F: fill-in-the-blank QA
All prompts are in Arabic. For MizanQA and ArabCulture, benchmark-specific system prompts from the original papers are preserved.
🏆 Leaderboard Results
Results as of April 2026; covering top 10 evaluated models. Visit the live leaderboard for current rankings.
Rank Model AVERAGE AraDiCE-Culture ArabicMMLU ArabCulture PALMX 3LM STEM AraTrust MizanQA MedArabiQ ArabLegalQA GAT MedAraBench HumanEval+ MBPP+ FannOrFlop
🥇 1Qwen/Qwen3.5-397B-A17B-FP868.0682.7877.5461.7583.9188.6790.0473.3647.3054.9455.8947.9767.6876.7244.33
🥈 2Applied-Innovation-Center/Karnak66.2073.3380.9453.4981.4093.1089.0855.9255.7871.5861.0654.1933.5464.5558.91
🥉 3inceptionai/Jais-2-70B-Chat65.8178.8981.2983.2483.7387.9690.2371.7852.7969.6051.6750.8919.5143.6556.13
#4Qwen/Qwen2.5-72B-Instruct65.7577.2273.7863.8377.7787.5588.5163.4950.0670.7455.9044.1937.2072.7557.51
#5Applied-Innovation-Center/AIC-165.3773.3372.0277.5276.1188.1390.6156.3653.7568.9662.1150.7828.0569.5847.83
#6Qwen/Qwen3.5-122B-A10B64.8474.4473.1737.7881.4686.1886.9764.0147.0455.1150.9052.4965.2472.4360.54
#7Sakalti/Ultiima-72B64.4978.3372.2868.7976.7583.7089.0860.4444.5869.1246.9142.2539.0274.0757.56
#8meta-llama/Llama-3.3-70B-Instruct63.9677.2271.5778.0577.9588.2885.6367.4456.2564.0051.1354.8627.4471.1624.43
#9Qwen/Qwen2.5-32B-Instruct63.2670.5668.7675.8072.0781.0385.8253.7848.0869.2756.9436.5134.1572.7593.10
#10FreedomIntelligence/AceGPT-v2-32B-Chat61.1476.6770.6279.7974.4684.8886.9763.8949.9671.4656.0447.3223.7854.5015.56
Scale does not guarantee best performance. The top 10 spans models from 32B to 397B parameters, with several mid-size models outperforming larger ones on specific domains.
Arabic-specialized models lead on cultural and linguistic tasks. Jais-2-70B-Chat ranks highest on ArabicMMLU and ArabCulture, while Karnak leads on 3LM STEM and ArabLegalQA.
Coding remains the hardest domain for Arabic-specialized models. The top HumanEval+ and MBPP+ scores belong to multilingual models, with Qwen3.5-397B leading both.
The Size-Performance Relationship
Across the full leaderboard (46 models), a clear but imperfect size-performance correlation emerges. However, there are interesting exceptions:
Arabic-specialized models often outperform size-matched multilingual models
Instruction-tuned models consistently outperform their base counterparts except for Qwen3
Some smaller Arabic-specialized models (Fanar-1-9B, ALLaM-7B) outperform much larger multilingual models on specific domains
🌟 What Makes QIMMA Different
To summarize the distinctive properties of QIMMA:
Property Details
Quality-first philosophyValidation runs before evaluation, not as an afterthought
Multi-model validationTwo LLMs with different training + human review for flagged cases
99% native ArabicAvoids translation artifacts almost entirely
Multi-domain, multi-task7 domains, 3 task types (MCQ, QA, code), 109 subsets
Code evaluationFirst Arabic leaderboard to include code generation
Full transparencyPer-sample inference outputs publicly released, not just aggregate scores
LightEval-basedUnified, reproducible evaluation codebase
Dialectal awarenessExplicit handling of MSA vs. dialectal variation in prompts and rubrics
🔗 Resources
🏆 Leaderboard: QIMMA Leaderboard
💻 Code: GitHub
📄 Paper: Are Arabic Benchmarks Reliable? QIMMA's Quality-First Approach to LLM Evaluation
🔖 Citation
@misc{alqadi2026arabicbenchmarksreliableqimmas, title={Are Arabic Benchmarks Reliable? QIMMA's Quality-First Approach to LLM Evaluation}, author={Leen AlQadi and Ahmed Alzubaidi and Mohammed Alyafeai and Hamza Alobeidli and Maitha Alhammadi and Shaikha Alsuwaidi and Omar Alkaabi and Basma El Amel Boussaha and Hakim Hacid}, year={2026}, eprint={2604.03395}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2604.03395}, }
More from this author
Falcon Perception
67
April 1, 2026
Alyah ⭐️: Toward Robust Evaluation of Emirati Dialect Capabilities in Arabic LLMs
25
January 27, 2026
Community
EditPreview
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Comment
· Sign up or log in to comment
Upvote
11
System theme
Company
TOS Privacy About Careers
Website
Models Datasets Spaces Pricing Docs