AI News HubLIVE
In-site rewrite5 min read

WorldVQA: Measuring Atomic World Knowledge in MLLMs

WorldVQA is a new benchmark to evaluate factual correctness of MLLMs on visual world knowledge. It includes 3,500 high-quality image-question pairs across 9 categories, with a focus on head vs tail distribution. Frontier models achieve below 50% accuracy, revealing overconfidence and gaps in visual knowledge.

SourceKimi Blog

WorldVQA: Measuring Atomic World Knowledge in MLLMs

Features

Sheets

Build Excel formulas, pivots & charts

Docs

Create, convert & review documents

Kimi Claw

Deploy 24/7 AI agents in one click

Kimi Code

AI Code Agent for Terminal & IDE

Kimi WebBridge

A browser extension for AI agents

Research

Kimi K2.6

Advancing Open-Source Coding

Agent Swarm

Scale Out, Not Just Up

WorldVQA

Atomic World Knowledge in MLLMs

Kimi K2.5

Visual Agentic Intelligence

Kimi Vendor Verifier

Rebuilding the Chain of Trust

Kimi K2 Thinking

Open-source thinking model

Kimi K2

Open Agentic Intelligence

Resources

Hermes Agent Overview

Hermes API Integration

OpenClaw SaaS

How to Deploy OpenClaw

How to Install OpenClaw on Mac

How to Install OpenClaw on Windows

AI Tools for Excel

Vibe Coding Guide

How to Vibe Code

How to Build a Website from Scratch

Refactor moonshot.ai with Kimi Code CLI

Pricing

Help Center

Getting Started

Agent Mode

Websites

Docs & Sheets

Slides

Deep Research

Kimi Claw

Membership

Kimi Code

Kimi API

Others

Kimi K2.6

Try Kimi

<Research

Introducing WorldVQA ​

A benchmark for evaluating atomic visual world knowledge in Multimodal LLMs.

Authors Kimi Team

Overview ​

We are releasing WorldVQA, a new benchmark designed to measure the factual correctness of Multimodal Large Language Models (MLLMs). While recent models have demonstrated impressive capabilities in visual reasoning and description, measuring their reliability regarding visual world knowledge remains a challenge.

WorldVQA focuses on a critical question: Does the model actually recognize the specific entity it sees, or is it merely hallucinating based on visual patterns?

Our results show that WorldVQA creates a significant challenge for frontier models. Even state-of-the-art models struggle to achieve high accuracy on long-tail visual knowledge, often falling below 50% accuracy. This benchmark aims to drive progress toward more factually reliable and knowledgeable multimodal AI.

The Dataset ​

The dataset consists of 3,500 high-quality image-question pairs. The distribution aims to test a model's encyclopedic breadth across the world. The dataset distinguishes itself through three core design principles:

Factuality & Unambiguity: Every question has a single, verifiable ground-truth answer. We exclude subjective questions or ambiguous visual scenarios.

Rich Taxonomy: The dataset spans 9 categories to ensure broad coverage of world knowledge.

Head vs. Tail Distribution: We explicitly separate data into Head (common knowledge) and Tail (rare/long-tail knowledge). This allows us to measure how model performance degrades as knowledge becomes more obscure.

Note on Quality: To ensure the benchmark is a reliable gold standard, all images and question-answer pairs underwent rigorous multi-stage human verification to filter out noise and ambiguity.

All NatureGeographyCultureObjectsTransportationEntertainmentBrandsSports

Nature & Environment

What bird is in the picture?

Answer:Chestnut Shortwing

Nature & Environment

What's the name of the flower in the picture?

Answer:Freesia

Locations & Architecture

图中出现的内容/文物是/属于哪个遗址?

Answer:善化寺

Locations & Architecture

What is the name of the natural landmark shown in the image?

Answer:Cape of Good Hope

Culture, Arts & Crafts

What is the title of the dance performance shown in the picture?

Answer:Swan Lake

Culture, Arts & Crafts

这个图片是什么珍品

Answer:战国水晶杯

Objects & Products

What style of bag is shown in the picture?

Answer:Shell bag

Objects & Products

What electronic consumer product is shown in the image? Provide the exact name and model number.

Answer:iPhone 17 Pro

Vehicles, Craft & Transportation

图中的飞行器是什么型号?

Answer:中国歼 - 20战斗机

Vehicles, Craft & Transportation

What specific attachment or accessory is this for the vehicle?

Answer:Roll cage

Entertainment, Media & Gaming

What is the name of the character in the picture?

Answer:Bayle the Dread

Entertainment, Media & Gaming

Which film or TV series is this image from?

Answer:Your Name

Brands, Logos & Graphic Design

What is the medium (carrier) of the advertisement in this image?

Answer:Direct-mail advertisement

Brands, Logos & Graphic Design

What is the name of the trademark or logo shown in the image?

Answer:EgyptAir

Sports, Gear & Venues

What track-and-field or gymnastics event is shown in the picture? Please be as specific as possible.

Answer:Floor exercise

Sports, Gear & Venues

图片中的建筑是哪座体育场馆?

Answer:上海体育场

Distribution of Tasks per Category ​

StatisticsNumber

Data

Total3500

Chinese (CN)1260 (36%)

English (EN)2240 (64%)

Category Categories

Total categories9

Nature & Environment (Nature)9.31%

Locations & Architecture (Geography)14.63%

Culture, Arts & Crafts (Culture)14.46%

Objects & Products (Objects)12.49%

Vehicles, Craft & Transportation (Transportation)8.74%

Entertainment, Media & Gaming (Entertainment)14.60%

Brands, Logos & Graphic Design (Brands)7.43%

Sports, Gear & Venues (Sports)4.06%

Notable People & Public Figures (People)14.29%

Difficulty

Easy31.16%

Medium40.77%

Hard28.07%

Using WorldVQA to compare models ​

Overall Model Accuracy

Accuracy (%)

Benchmark Kimi K2.5 Gemini-3-pro Gemini-2.5-pro Seed-1.5-vision-pro Claude-opus-4.5 Claude-sonnet-4.5 GPT-5.2 GPT-5.1 GPT-4o Grok-4.1-fast-reasoning Grok-4-fast-reasoning Kimi-VL-16B-A3B Qwen3-VL-235B-A22B-Instruct Qwen3-VL-32B-Instruct GLM-4.6V GLM-4.6V-Flash

Overall results

Accuracy

46.3 47.4 36.9 34.9 36.8 20.0 28.0 24.5 22.2 21.1 18.9 12.0 23.5 17.7 19.0 14.8

Not Attempted

2.1 0.6 0.1 1.6 3.4 8.0 5.4 16.3 9.1 0.1 0.2 3.3 0.0 0.0 0.0 0.1

Correct Given Attempted

47.3 47.7 36.9 35.5 38.1 21.8 29.5 29.3 24.4 21.1 19.0 12.4 23.5 17.7 19.0 14.8

F-score

46.8 47.5 36.9 35.2 37.5 20.9 28.7 26.7 23.3 21.1 18.9 12.2 23.5 17.7 19.0 14.8

F-score on 9 task categories

Nature

40.6 45.1 37.1 41.4 32.5 19.4 24.3 27.3 25.6 18.4 17.8 11.2 26.1 18.1 24.5 16.0

Geography

46.8 44.7 33.8 36.1 36.5 21.0 29.1 25.1 20.6 23.6 19.0 13.9 24.8 18.0 21.5 16.3

Culture

43.0 47.2 32.6 33.4 34.1 17.4 26.7 22.5 17.8 20.2 18.6 10.1 22.9 16.8 17.8 13.2

Objects

44.7 48.1 39.6 32.8 39.6 22.9 26.6 26.6 19.1 25.2 22.0 10.8 26.1 19.0 19.2 14.9

Transportation

47.4 45.1 39.9 35.0 43.5 24.8 30.7 31.6 26.2 23.5 20.3 13.5 28.8 19.0 18.6 19.0

Entertainment

48.1 47.6 34.2 33.6 29.0 11.6 24.8 18.5 19.1 11.4 8.3 7.9 15.5 12.1 12.5 7.8

Brands

52.6 52.4 38.8 32.3 47.6 32.2 39.1 36.0 35.2 25.8 26.6 20.8 22.3 23.8 20.4 18.8

Sports

64.8 59.4 54.2 43.7 54.9 31.0 40.8 45.4 44.5 30.3 34.5 17.7 26.1 20.4 23.2 20.4

People

50.9 — — — — — — — — — — 7.4 26.2 13.1 10.7 8.2

Measuring Calibration: Confidence vs. Accuracy ​

In our experiments comparing model confidence with actual accuracy, we utilized two key metrics to measure the alignment between a model's subjective belief and its objective performance:

ECE (Expected Calibration Error): Measures the average gap between the model's subjective confidence and its objective accuracy. The ideal value is 0.

Slope (Weighted Average Slope): Measures the correlation and sensitivity between the model's accuracy and its own confidence. The ideal value is 1.0.

Calibration and Confidence Distribution Analysis. Left: Reliability diagrams plotting Actual Accuracy against Stated Confidence. To ensure statistical significance, only bins containing more than 20 samples are visualized. The size of each data point is proportional to the number of samples in that bin. The black dashed diagonal (y=x) represents perfect calibration, while colored dashed lines indicate the weighted average slope for each model. Right: The distribution of stated confidence scores across the full dataset (without sample thresholding). The plots reveal a severe overconfidence trend, with most models concentrating their predictions in the 90-100% confidence range.

Our experiments reveal that all evaluated models are currently far from the ideal state, exhibiting a universal tendency toward overconfidence.

While Kimi-K2.5 achieves best performance on both metrics—recording an ECE of 37.9% and a Slope of 0.550—there remains a significant gap to bridge in the pursuit of "honesty" and "alignment." Enhancing the self-awareness boundaries of multimodal models represents a critical direction for future exploration.

Conclusion ​

WorldVQA is a simple but challenging benchmark for evaluating the atomic visual knowledge of frontier models. Improving performance on WorldVQA is a necessary step for the next generation of AI agents. We are open-sourcing the WorldVQA dataset and evaluation scripts to help the community address the visual knowledge gap.

Read the Paper: https://arxiv.org/abs/2602.02537

View the Code: https://github.com/MoonshotAI/WorldVQA

Download the Data: https://huggingface.co/datasets/moonshotai/WorldVQA

Products

Kimi Open Platform Kimi Code Pricing

Features

AI Agent Agent Swarm AI Website Builder AI Document Agent AI Slides Generator AI Sheets Agent Deep Research

Company

Moonshot AI Terms of Service Privacy Policy