Frontier Language Model Intelligence, over Time
Artificial Analysis tracks the intelligence of leading AI models over time with its independent Intelligence Index. The index includes 10 evaluations covering reasoning, coding, knowledge, and more, helping users choose the best model for their needs.
Artificial Analysis
Independent analysis of AI
Understand the AI landscape to choose the best model and provider for your use case
Launch
AA-AgentPerf
The first agentic inference benchmark
Launch
Coding Agent Benchmarks
Introducing the Artificial Analysis Coding Agent Benchmarks
Highlights
Intelligence
Artificial Analysis Intelligence Index · Higher is better
Not currently available
Speed
Output tokens per second · Higher is better
Not currently available
Updated
Price
USD per 1M tokens (blended) · Lower is better
Not currently available
Personalized model recommender
Get personalized recommendations based on your priorities for intelligence, speed, and cost
Explore agents for general work, coding, customer support, and more
Compare AI agents across capabilities, pricing, and platform support
Explore premium plans
Access expanded benchmark data, custom visualizations, industry reports, and more
Changelog
New language model evaluation · 11 Jun
HyperNova 60B 2605
New language model evaluation · 10 Jun
Gemma 4 12B (Non-reasoning)
New article published · 9 Jun
Claude Fable 5 Launches at #1 on the Artificial Analysis Intelligence Index
New article published · 9 Jun
Claude Fable 5: the first public Mythos-class model
New article published · 9 Jun
North Mini Code: Cohere's small coding-focused MoE model
New language model evaluation · 9 Jun
Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)
New language model evaluation · 9 Jun
North Mini Code
New article published · 8 Jun
MiniMax-M3: Leading open weights model, once the weights are released
New language model evaluation · 8 Jun
LFM2.5-8B-A1B
New language model evaluation · 7 Jun
Gemma 4 12B (Reasoning)
New article published · 4 Jun
NVIDIA Nemotron 3 Ultra released: fast, intelligent, and open
New language model evaluation · 4 Jun
MiniCPM5-1B (Reasoning)
New language model evaluation · 4 Jun
Nemotron 3 Ultra 550B A55B (Reasoning)
New video model · 4 Jun
grok-imagine-video-1.5-preview
New article published · 3 Jun
Fun-Realtime-TTS: New Text to Speech model topping Artificial Analysis leaderboard
New language model evaluation · 3 Jun
MiniMax-M3
New article published · 2 Jun
MAI-Transcribe-1.5: New Speech to Text model leading the accuracy-speed Pareto frontier
New language model evaluation · 2 Jun
Qwen3.7 Plus
New article published · 1 Jun
AA-WER Streaming: New Speech to Text Streaming Benchmark
New language model evaluation · 1 Jun
Step 3.7 FlashSee more
Intelligence
Intelligence of leading AI models based on our independent evaluations
Artificial Analysis Intelligence Index
Artificial Analysis Intelligence Index v4.0 incorporates 10 evaluations: GDPval-AA, 𝜏²-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity's Last Exam, GPQA Diamond, CritPt
Not currently available
Reasoning models are indicated by a lightbulb icon
Artificial Analysis Intelligence Index v4.0 includes: GDPval-AA, 𝜏²-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity's Last Exam, GPQA Diamond, CritPt. See Intelligence Index methodology for further details, including a breakdown of each evaluation and how we run them.
Artificial Analysis Intelligence Index by Open Weights / Proprietary
Artificial Analysis Intelligence Index v4.0 incorporates 10 evaluations: GDPval-AA, 𝜏²-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity's Last Exam, GPQA Diamond, CritPt
Not currently available
Reasoning models are indicated by a lightbulb icon
Artificial Analysis Intelligence Index v4.0 includes: GDPval-AA, 𝜏²-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity's Last Exam, GPQA Diamond, CritPt. See Intelligence Index methodology for further details, including a breakdown of each evaluation and how we run them.
Indicates whether the model weights are available. Models are labelled as 'Commercial Use Restricted' if the weights are available but commercial use is limited (typically requires obtaining a paid license).
Intelligence vs. Cost to Run Artificial Analysis Intelligence Index
Artificial Analysis Intelligence Index · Cost to run Intelligence Index
Most attractive quadrant
Reasoning models are indicated by a lightbulb icon.
The cost to run the evaluations in the Artificial Analysis Intelligence Index, calculated using the model's input and output token pricing and the number of tokens used across evaluations (excluding repeats).
Artificial Analysis Intelligence Index v4.0 includes: GDPval-AA, 𝜏²-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity's Last Exam, GPQA Diamond, CritPt. See Intelligence Index methodology for further details, including a breakdown of each evaluation and how we run them.
Create custom visualizationsCreate your own charts and tables comparing models and providers, save groups of models, and export data.
Go to Data Playground
Frontier Language Model Intelligence, Over Time
Artificial Analysis Intelligence Index v4.0 incorporates 10 evaluations: GDPval-AA, 𝜏²-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity's Last Exam, GPQA Diamond, CritPt
Reasoning models are indicated by a lightbulb icon.
Artificial Analysis Intelligence Index v4.0 includes: GDPval-AA, 𝜏²-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity's Last Exam, GPQA Diamond, CritPt. See Intelligence Index methodology for further details, including a breakdown of each evaluation and how we run them.
Artificial Analysis Coding Agent IndexUpdated
Performance, cost, and execution time for leading coding agents on end-to-end software engineering tasks
Explore Artificial Analysis Coding Agent Index
Artificial Analysis Coding Agent Index
Composite average pass@1 across DeepSWE, Terminal-Bench v2, and SWE-Atlas-QnA · Higher is better
Color by
Not currently available
Image & Video Leaderboards
Top models from our Image Arena and Video Arena leaderboards, with 95% confidence intervals
Text to Image Leaderboard
Elo scores from blind preference votes in our Image Arena. See the full leaderboard here.
Intelligence Breakdown
Intelligence Evaluations
Intelligence evaluations measured independently by Artificial Analysis · Higher is better
GDPval-AA
Agentic real-world work tasks, (Elo-500)/2000
Terminal-Bench Hard
Agentic coding & terminal use
𝜏²-Bench Telecom
Agentic tool use
AA-LCR
Long context reasoning
AA-Omniscience Accuracy
Knowledge
AA-Omniscience Non-Hallucination Rate
1 - hallucination rate
Humanity's Last Exam
Reasoning & knowledge
GPQA Diamond
Scientific reasoning
SciCode
Coding
IFBench
Instruction following
CritPt
Physics reasoning
APEX-Agents-AA
Long-horizon agentic tasks
ITBench-AANew
Kubernetes incident root-cause analysis
MMMU-Pro
Visual reasoning
Reasoning models are indicated by a lightbulb icon.
While model intelligence generally translates across use cases, specific evaluations may be more relevant for certain use cases.
Artificial Analysis Intelligence Index v4.0 includes: GDPval-AA, 𝜏²-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity's Last Exam, GPQA Diamond, CritPt. See Intelligence Index methodology for further details, including a breakdown of each evaluation and how we run them.
AA-Omniscience
AA-Omniscience is a knowledge and hallucination benchmark that rewards accuracy, punishes bad guesses and provides a comprehensive view of which models produce factually reliable outputs across different domains
AA-Omniscience Index
AA-Omniscience Index (higher is better) measures knowledge reliability and hallucination. It rewards correct answers, penalizes hallucinations, and has no penalty for refusing to answer. Scores range from -100 to 100, where 0 means as many correct as incorrect answers, and negative scores mean more incorrect than correct.
Reasoning models are indicated by a lightbulb icon
AA-Omniscience Index (higher is better) measures knowledge reliability and hallucination. It rewards correct answers, penalizes hallucinations, and has no penalty for refusing to answer. Scores range from -100 to 100, where 0 means as many correct as incorrect answers, and negative scores mean more incorrect than correct.
GDPval-AA
GDPval-AA evaluates AI models on real-world, economically valuable tasks across a wide range of occupations
GDPval-AA Leaderboard
Elo scores for agentic performance on real-world work tasks using web and shell access via Stirrup, an open-source harness developed by Artificial Analysis
ITBench-AANew
ITBench-AA evaluates AI agents on Kubernetes incident root-cause analysis from offline incident snapshots
ITBench-AA Average precision at full recall
Average precision at full recall on Kubernetes incident root-cause analysis from offline incident snapshots · For Site Reliability Engineering (SRE) tasks
Reasoning models are indicated by a lightbulb icon
Average precision at full recall on ITBench-AA, Artificial Analysis' implementation of IBM's ITBench benchmark for Kubernetes incident root-cause analysis from offline incident snapshots.
Artificial Analysis Openness Index
Artificial Analysis Openness Index assesses how 'open' models are on the basis of their availability and transparency across different components.
Artificial Analysis Openness Index: Components
Openness Index underlying score contribution by components, up to a maximum of 18 (higher is more open)
Reasoning models are indicated by a lightbulb icon
Artificial Analysis Openness Index vs. Artificial Analysis Intelligence Index
Most attractive quadrant
Output Tokens
Output tokens of leading AI models based on our independent evaluations
Output Tokens Used to Run Artificial Analysis Intelligence Index
Tokens used to run all evaluations in the Artificial Analysis Intelligence Index
Reasoning models are indicated by a lightbulb icon
The number of tokens required to run all evaluations in the Artificial Analysis Intelligence Index (excluding repeats).
Cost Efficiency
Cost of leading AI models based on our independent evaluations
Cost to Run Artificial Analysis Intelligence Index
Cost (USD) to run all evaluations in the Artificial Analysis Intelligence Index
Reasoning models are indicated by a lightbulb icon
The cost to run the evaluations in the Artificial Analysis Intelligence Index, calculated using the model's input and output token pricing and the number of tokens used across evaluations (excluding repeats).
Speed & Latency
Comparison of first-party API performance
Output Speed
Output tokens per second · Higher is better
Reasoning models are indicated by a lightbulb icon
Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API for models which support streaming).
Figures represent performance of the model's first-party API (e.g. OpenAI for o1) or the median across providers where a first-party API is not available (e.g. Meta's Llama models).
PriceUpdated
Price of leading AI models based on our independent evaluations
Pricing: Cache Hit, Input, and Output
Price (USD per M Tokens)
Reasoning models are indicated by a lightbulb icon
Price per token for cached prompts (previously processed), typically offering a significant discount compared to regular input price, represented as USD per million tokens. The values shown here are the cache hit price; cache write and cache storage are billed separately and vary by provider — see "Cache pricing by provider" for detail.
Price per token included in the request/message sent to the API, represented as USD per million Tokens.
The blended cache price shown here uses cache hit price only. Other caching costs differ by provider:
Anthropic: charges a separate cache write fee, with different ra
[truncated for AI cost control]