Evaluating agents for scientific discovery
Two benchmarks developed at Ai2 – ScienceWorld and DiscoveryWorld – reveal that even incredibly strong AI science agents struggle with problems human scientists solve routinely. ScienceWorld tests basic experiment execution, while DiscoveryWorld evaluates end-to-end scientific discovery. Current top models score ~80% on ScienceWorld and only ~20% on hard DiscoveryWorld tasks, compared to ~70% for human scientists.
Evaluating agents for scientific discovery | Ai2
Skip to main content ->
Ai2
Open models
Open models
Olmo
Tülu 3
Molmo
Playground
Language models
Multimodal models
Evaluation frameworks
Open data
Applications
AI for science
Asta
AstaBench
Research with Asta
Asta leaderboards
Semantic Scholar
All projects
AI for the planet
OlmoEarth
EarthRanger
Skylight
Climate Modeling
All projects
AI for robotics
Embodied AI
Research
Research
Latest
Papers
Research principles
News
Institute
Institute
About
Careers
Media center
Navigation Menu
Evaluating agents for scientific discovery
April 13, 2026
Ai2
Share
Everyone's building AI science agents. But how do you know if they actually work?
Open any social media feed and you'll find teams announcing agents that design experiments, write code, and produce entire research papers. The claims are extraordinary. The evidence behind them, usually, is not. That's why we've spent years building benchmarks that test whether AI agents can actually do science. Two developed at Ai2 – ScienceWorld, released in 2022, and DiscoveryWorld, released in 2024 – have taken on new significance, as the capabilities of today's models have caught up to the challenges we designed the benchmarks to measure.
In 2022, the best AI models scored highly on multiple-choice grade-school science exams. But when those same models were asked to demonstrate that knowledge by performing experiments in a simple virtual environment, they scored below 10%, highlighting the difference between “book smarts” and "street smarts."
That virtual environment was ScienceWorld, our benchmark that tests whether agents can carry out elementary-school science experiments. Three years later, top frontier models (as of early 2025) score in the low 80s—real progress, but still short of fully solving a 4th-grade science curriculum. And on DiscoveryWorld, our harder benchmark that asks agents to design and execute their own scientific investigations, some of the best systems complete only ~20% of tasks at higher difficulty—problems that average human scientists with advanced degrees solve ~70% of the time.
Some data derived from "TALES: Text Adventure Learning Environment Suite." Cui, C. Z., Yuan, X., Xiao, Z., Ammanabrolu, P., & Côté, M. A. (2025). ArXiv. https://arxiv.org/abs/2504.14128
"So many folks are jumping on the science agent bandwagon and releasing agents," says Ai2 Researcher Peter Jansen, who led development of ScienceWorld and DiscoveryWorld and has built much of the modern infrastructure enabling language models to be evaluated on text-based games. "But if the best systems a year ago couldn't even solve most of the easy problems in DiscoveryWorld, how likely is it that they're much better today?"
End-to-end scientific discovery, in simulation
DiscoveryWorld, released in 2024, is the first benchmark built to test whether an agent can design and execute end-to-end scientific investigations from scratch. DiscoveryWorld takes place on Planet X, a hypothetical space colony in the not-so-distant future, and the player takes the role of one of the scientists on Planet X.
DiscoveryWorld contains 120 challenge tasks spanning eight topics – from proteomics and rocket science to radioisotope dating and epidemiology – across three difficulty levels, with parametric variations that change the data, solution, and environment layout each run. Tasks are set in fictional scientific contexts so agents can't fall back on prior knowledge: in one, an agent has to determine the cause of an illness outbreak; in another, it has to uncover the mathematical relationship governing a quantum reactor. Each requires forming hypotheses, designing experiments, running them, and analyzing results — often over hundreds of in-game actions. DiscoveryWorld scores not just whether the agent solved the task, but whether it followed a scientific process and whether it actually understood the discovery it made, distinguishing genuine insight from lucky guessing.
As Jansen describes it, these evaluations measure the difference between "book smarts" (like answering exam questions) and "street smarts"—i.e., using the scientific method to make new discoveries. While practicing human scientists find all of the discoveries in DiscoveryWorld, recent leading agents can't complete roughly 80% of DiscoveryWorld's tasks at normal and challenge difficulty—knowing what a concept is and being able to apply it are different things entirely.
Despite this difficulty – or perhaps because of it – DiscoveryWorld has drawn wide interest. The paper has been cited nearly 80 times and covered by New Scientist.
"We tend to release benchmarks that start out being very challenging, but they become much more popular a year or two later as models and methods catch up," Jansen says. "ScienceWorld was very much like that, and DiscoveryWorld seems like it's getting like that now. In fact, with models at their current price-to-performance ratio, I’d argue there’s never been a better time to test whether your agent can solve long-horizon scientific discovery tasks with DiscoveryWorld.”
Executing experiments at the elementary level
ScienceWorld is a more foundational benchmark. Where DiscoveryWorld tests open-ended discovery at a college or PhD level – designing novel investigations and interpreting ambiguous results – ScienceWorld asks whether an agent can “re-make” classic scientific discoveries at roughly an elementary-school level—the kinds of experiments found in today's science textbooks.
ScienceWorld places agents inside a text-based simulated world spanning ten interconnected locations – a kitchen, a workshop, a greenhouse, and others – populated with around 200 types of objects that behave as they would in a real lab: ice melts when heated, circuits conduct based on the materials used, and plants grow under the right conditions. Instead of picking the boiling point of water from a list of multiple-choice answers, an agent might be given an unknown substance, a thermometer, and a stove, and asked to figure out the boiling point itself. Agents issue text commands and receive descriptions of what happens next, working through 30 task types across categories like changing states of matter, mixing chemicals, and running Mendelian genetics crosses. Each of the 30 tasks has hundreds of randomized configurations, so an agent can't succeed by memorizing solutions—it has to generalize.
ScienceWorld places agents in an interactive environment where they're tasked with executing experiments.
At this level, the gap between knowing and doing is still wide. When ScienceWorld launched, the same models that received an 'A' grade on the ARC science exam – a standard benchmark for scientific knowledge – failed more than 90% of ScienceWorld, despite both covering the same conceptual material. Knowing what a melting point is turns out to be a far cry from figuring out how to measure one.
Scores have climbed since then. TALES, a 2025 benchmark suite from Microsoft Research that includes ScienceWorld, found that leading models scored in the low 80s—a dramatic improvement from sub-10% three years earlier, but still short of fully solving the tasks.
“We hope that in the near future, science agents will help treat diseases, create new materials, and generate other important discoveries,” Jensen says. “DiscoveryWorld and ScienceWorld help measure whether agents can begin that process by testing their end-to-end scientific capabilities in simplified virtual worlds. If an agent flunks basic science, what hope does it have of curing cancer?”
Benchmarks like DiscoveryWorld and ScienceWorld help test what science agents are actually capable of—and we're building them alongside systems that push the boundaries of what's possible, because making progress and measuring it are two sides of the same effort. DiscoveryWorld and ScienceWorld are open and freely available, with the goal of helping turn promising ideas into proven results.
Subscribe to receive monthly updates about the latest Ai2 news.
First Name
Last Name
Sign up
Contact us
Questions about our work, or need support with one of our technologies?
Get in touch
Resources
Media center
Documentation
Careers
Team directory
Community
Discord
X/Twitter
GitHub
Hugging Face
Bluesky
Legal
Terms of use
Privacy policy
DMCA policy
Business code of conduct
Responsible use
© The Allen Institute for Artificial Intelligence - All Rights Reserved.