AI News HubLIVE
站内改写5 min read

Frontier Language Model Intelligence, over Time

Artificial Analysis tracks the intelligence of leading AI models over time with its independent Intelligence Index. The index includes 10 evaluations covering reasoning, coding, knowledge, and more, helping users choose the best model for their needs.

SourceHacker News AIAuthor: doener

Artificial Analysis

Independent analysis of AI

Understand the AI landscape to choose the best model and provider for your use case

Launch

AA-AgentPerf

The first agentic inference benchmark

Launch

Coding Agent Benchmarks

Introducing the Artificial Analysis Coding Agent Benchmarks

Highlights

Intelligence

Artificial Analysis Intelligence Index · Higher is better

Not currently available

Speed

Output tokens per second · Higher is better

Not currently available

Updated

Price

USD per 1M tokens (blended) · Lower is better

Not currently available

Personalized model recommender

Get personalized recommendations based on your priorities for intelligence, speed, and cost

Explore agents for general work, coding, customer support, and more

Compare AI agents across capabilities, pricing, and platform support

Explore premium plans

Access expanded benchmark data, custom visualizations, industry reports, and more

Changelog

New language model evaluation · 11 Jun

HyperNova 60B 2605

New language model evaluation · 10 Jun

Gemma 4 12B (Non-reasoning)

New article published · 9 Jun

Claude Fable 5 Launches at #1 on the Artificial Analysis Intelligence Index

New article published · 9 Jun

Claude Fable 5: the first public Mythos-class model

New article published · 9 Jun

North Mini Code: Cohere's small coding-focused MoE model

New language model evaluation · 9 Jun

Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)

New language model evaluation · 9 Jun

North Mini Code

New article published · 8 Jun

MiniMax-M3: Leading open weights model, once the weights are released

New language model evaluation · 8 Jun

LFM2.5-8B-A1B

New language model evaluation · 7 Jun

Gemma 4 12B (Reasoning)

New article published · 4 Jun

NVIDIA Nemotron 3 Ultra released: fast, intelligent, and open

New language model evaluation · 4 Jun

MiniCPM5-1B (Reasoning)

New language model evaluation · 4 Jun

Nemotron 3 Ultra 550B A55B (Reasoning)

New video model · 4 Jun

grok-imagine-video-1.5-preview

New article published · 3 Jun

Fun-Realtime-TTS: New Text to Speech model topping Artificial Analysis leaderboard

New language model evaluation · 3 Jun

MiniMax-M3

New article published · 2 Jun

MAI-Transcribe-1.5: New Speech to Text model leading the accuracy-speed Pareto frontier

New language model evaluation · 2 Jun

Qwen3.7 Plus

New article published · 1 Jun

AA-WER Streaming: New Speech to Text Streaming Benchmark

New language model evaluation · 1 Jun

Step 3.7 FlashSee more

Intelligence

Intelligence of leading AI models based on our independent evaluations

Artificial Analysis Intelligence Index

Artificial Analysis Intelligence Index v4.0 incorporates 10 evaluations: GDPval-AA, 𝜏²-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity's Last Exam, GPQA Diamond, CritPt

Not currently available

Reasoning models are indicated by a lightbulb icon

Artificial Analysis Intelligence Index v4.0 includes: GDPval-AA, 𝜏²-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity's Last Exam, GPQA Diamond, CritPt. See Intelligence Index methodology for further details, including a breakdown of each evaluation and how we run them.

Artificial Analysis Intelligence Index by Open Weights / Proprietary

Artificial Analysis Intelligence Index v4.0 incorporates 10 evaluations: GDPval-AA, 𝜏²-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity's Last Exam, GPQA Diamond, CritPt

Not currently available

Reasoning models are indicated by a lightbulb icon

Artificial Analysis Intelligence Index v4.0 includes: GDPval-AA, 𝜏²-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity's Last Exam, GPQA Diamond, CritPt. See Intelligence Index methodology for further details, including a breakdown of each evaluation and how we run them.

Indicates whether the model weights are available. Models are labelled as 'Commercial Use Restricted' if the weights are available but commercial use is limited (typically requires obtaining a paid license).

Intelligence vs. Cost to Run Artificial Analysis Intelligence Index

Artificial Analysis Intelligence Index · Cost to run Intelligence Index

Most attractive quadrant

Reasoning models are indicated by a lightbulb icon.

The cost to run the evaluations in the Artificial Analysis Intelligence Index, calculated using the model's input and output token pricing and the number of tokens used across evaluations (excluding repeats).

Artificial Analysis Intelligence Index v4.0 includes: GDPval-AA, 𝜏²-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity's Last Exam, GPQA Diamond, CritPt. See Intelligence Index methodology for further details, including a breakdown of each evaluation and how we run them.

Create custom visualizationsCreate your own charts and tables comparing models and providers, save groups of models, and export data.

Go to Data Playground

Frontier Language Model Intelligence, Over Time

Artificial Analysis Intelligence Index v4.0 incorporates 10 evaluations: GDPval-AA, 𝜏²-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity's Last Exam, GPQA Diamond, CritPt

Reasoning models are indicated by a lightbulb icon.

Artificial Analysis Intelligence Index v4.0 includes: GDPval-AA, 𝜏²-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity's Last Exam, GPQA Diamond, CritPt. See Intelligence Index methodology for further details, including a breakdown of each evaluation and how we run them.

Artificial Analysis Coding Agent IndexUpdated

Performance, cost, and execution time for leading coding agents on end-to-end software engineering tasks

Explore Artificial Analysis Coding Agent Index

Artificial Analysis Coding Agent Index

Composite average pass@1 across DeepSWE, Terminal-Bench v2, and SWE-Atlas-QnA · Higher is better

Color by

Not currently available

Image & Video Leaderboards

Top models from our Image Arena and Video Arena leaderboards, with 95% confidence intervals

Text to Image Leaderboard

Elo scores from blind preference votes in our Image Arena. See the full leaderboard here.

Intelligence Breakdown

Intelligence Evaluations

Intelligence evaluations measured independently by Artificial Analysis · Higher is better

GDPval-AA

Agentic real-world work tasks, (Elo-500)/2000

Terminal-Bench Hard

Agentic coding & terminal use

𝜏²-Bench Telecom

Agentic tool use

AA-LCR

Long context reasoning

AA-Omniscience Accuracy

Knowledge

AA-Omniscience Non-Hallucination Rate

1 - hallucination rate

Humanity's Last Exam

Reasoning & knowledge

GPQA Diamond

Scientific reasoning

SciCode

Coding

IFBench

Instruction following

CritPt

Physics reasoning

APEX-Agents-AA

Long-horizon agentic tasks

ITBench-AANew

Kubernetes incident root-cause analysis

MMMU-Pro

Visual reasoning

Reasoning models are indicated by a lightbulb icon.

While model intelligence generally translates across use cases, specific evaluations may be more relevant for certain use cases.

Artificial Analysis Intelligence Index v4.0 includes: GDPval-AA, 𝜏²-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity's Last Exam, GPQA Diamond, CritPt. See Intelligence Index methodology for further details, including a breakdown of each evaluation and how we run them.

AA-Omniscience

AA-Omniscience is a knowledge and hallucination benchmark that rewards accuracy, punishes bad guesses and provides a comprehensive view of which models produce factually reliable outputs across different domains

AA-Omniscience Index

AA-Omniscience Index (higher is better) measures knowledge reliability and hallucination. It rewards correct answers, penalizes hallucinations, and has no penalty for refusing to answer. Scores range from -100 to 100, where 0 means as many correct as incorrect answers, and negative scores mean more incorrect than correct.

Reasoning models are indicated by a lightbulb icon

AA-Omniscience Index (higher is better) measures knowledge reliability and hallucination. It rewards correct answers, penalizes hallucinations, and has no penalty for refusing to answer. Scores range from -100 to 100, where 0 means as many correct as incorrect answers, and negative scores mean more incorrect than correct.

GDPval-AA

GDPval-AA evaluates AI models on real-world, economically valuable tasks across a wide range of occupations

GDPval-AA Leaderboard

Elo scores for agentic performance on real-world work tasks using web and shell access via Stirrup, an open-source harness developed by Artificial Analysis

ITBench-AANew

ITBench-AA evaluates AI agents on Kubernetes incident root-cause analysis from offline incident snapshots

ITBench-AA Average precision at full recall

Average precision at full recall on Kubernetes incident root-cause analysis from offline incident snapshots · For Site Reliability Engineering (SRE) tasks

Reasoning models are indicated by a lightbulb icon

Average precision at full recall on ITBench-AA, Artificial Analysis' implementation of IBM's ITBench benchmark for Kubernetes incident root-cause analysis from offline incident snapshots.

Artificial Analysis Openness Index

Artificial Analysis Openness Index assesses how 'open' models are on the basis of their availability and transparency across different components.

Artificial Analysis Openness Index: Components

Openness Index underlying score contribution by components, up to a maximum of 18 (higher is more open)

Reasoning models are indicated by a lightbulb icon

Artificial Analysis Openness Index vs. Artificial Analysis Intelligence Index

Most attractive quadrant

Output Tokens

Output tokens of leading AI models based on our independent evaluations

Output Tokens Used to Run Artificial Analysis Intelligence Index

Tokens used to run all evaluations in the Artificial Analysis Intelligence Index

Reasoning models are indicated by a lightbulb icon

The number of tokens required to run all evaluations in the Artificial Analysis Intelligence Index (excluding repeats).

Cost Efficiency

Cost of leading AI models based on our independent evaluations

Cost to Run Artificial Analysis Intelligence Index

Cost (USD) to run all evaluations in the Artificial Analysis Intelligence Index

Reasoning models are indicated by a lightbulb icon

The cost to run the evaluations in the Artificial Analysis Intelligence Index, calculated using the model's input and output token pricing and the number of tokens used across evaluations (excluding repeats).

Speed & Latency

Comparison of first-party API performance

Output Speed

Output tokens per second · Higher is better

Reasoning models are indicated by a lightbulb icon

Tokens per second received while the model is generating tokens (ie. after first chunk has been received from the API for models which support streaming).

Figures represent performance of the model's first-party API (e.g. OpenAI for o1) or the median across providers where a first-party API is not available (e.g. Meta's Llama models).

PriceUpdated

Price of leading AI models based on our independent evaluations

Pricing: Cache Hit, Input, and Output

Price (USD per M Tokens)

Reasoning models are indicated by a lightbulb icon

Price per token for cached prompts (previously processed), typically offering a significant discount compared to regular input price, represented as USD per million tokens. The values shown here are the cache hit price; cache write and cache storage are billed separately and vary by provider — see "Cache pricing by provider" for detail.

Price per token included in the request/message sent to the API, represented as USD per million Tokens.

The blended cache price shown here uses cache hit price only. Other caching costs differ by provider:

Anthropic: charges a separate cache write fee, with different ra

[truncated for AI cost control]