AI News HubLIVE
Public articles 14Collected articles 16Trust 90Refresh 30 min
Health Auto-pausedSource type ResearchFull-text rights Official full textLast ingested 2026-05-23ID ai2-blogStatus Not enabled

Official Allen Institute for AI research feed; verify terms before displaying full body.

Latest public articles

Introducing AIMIP: The AI weather and climate model intercomparison project

AIMIP is a new open benchmark and dataset for evaluating AI climate models, showing they can match or beat conventional models on some historical climate metrics while still struggling to generalize reliably to long-term warming trends and unseen climate scenarios.

  • AIMIP provides a shared benchmark and dataset for comparing AI climate models.
  • AI climate models show competitive performance on average historical climate patterns.
In-site article

Why Artificial Analysis uses Ai2's IFBench instruction-following eval

Artificial Analysis employs Ai2's open IFBench evaluation because it captures a stubborn, real-world capability many benchmarks miss: whether models can reliably follow complex, multi-part user instructions.

  • IFBench tests a model's ability to follow multiple constraints simultaneously, reflecting real user queries.
  • The benchmark is built from real user conversations and covers diverse tasks, making it more practical than traditional instruction-following evals.
In-site article

EMO: Pretraining mixture of experts for emergent modularity

EMO is a new mixture-of-experts model trained so modular expert groups emerge from data, enabling users to select small task-specific expert subsets while preserving near full-model performance.

  • EMO uses document-level routing constraints to make experts specialize in semantic domains.
  • With only 12.5% of experts, EMO retains near full-model performance; standard MoE degrades severely.
In-site article

Open by design: Ai2 brings fully open AI infrastructure online with NSF OMAI

Ai2 is bringing NSF OMAI compute online to power a fully open AI research ecosystem, turning national infrastructure investment into reusable models, data, methods, and tools that can accelerate scientific discovery.

  • Ai2 received $152 million from NSF and NVIDIA to build NSF OMAI, now operational with NVIDIA Blackwell Ultra.
  • The infrastructure emphasizes openness and reusability, maximizing the impact of every GPU hour.
In-site article

MolmoAct 2: An open foundation for robots that work in the real world

MolmoAct 2 is a fully open robotics foundation model with faster 3D action reasoning, a new bimanual dataset, and strong zero-shot performance on real-world tasks.

  • MolmoAct 2 outperforms proprietary models on industry benchmarks and runs up to 37x faster than its predecessor.
  • The release includes the largest open-source bimanual manipulation dataset with over 720 hours of demonstrations.
In-site article

What’s next for Ai2: A conversation with Interim CEO Peter Clark

Interim CEO Peter Clark discusses Ai2's ongoing commitment to open science amidst rapid AI progress, highlighting key projects, the NSF OMAI initiative, and future directions in AI for science, embodied AI, and environmental AI.

  • Ai2 remains committed to open science in a fast-paced AI landscape.
  • Projects like OLMo, Molmo, and AutoDiscovery exemplify open frontier models and real-world impact.
In-site article

AstaBench update: New results, plus adoption from industry

AstaBench’s latest update adds new frontier-model results, including GPT-5.5, and highlights growing adoption from groups including the UK AISI, General Reasoning, Elicit, SciSpace, Distyl AI, and EvoScientist.

  • Tested frontier models including GPT-5.5 and Claude Opus 4.7 on over 2.4K research problems.
  • Claude Opus 4.7 leads at 58.0% overall but is most expensive; GPT-5.5 scores 52.9% at lower cost, leading non-Claude models.
In-site article

Molmo learns to point and act

Ai2 releases MolmoPoint and MolmoWeb, extending the Molmo family from visual understanding to visual action. MolmoPoint achieves state-of-the-art pointing by selecting directly from input data, while MolmoWeb is a vision-based web agent that navigates websites via screenshots and mouse/keyboard actions, outperforming many open and closed models. Both are open-source.

  • MolmoPoint improves pointing by selecting directly from input, not generating coordinates, boosting accuracy and efficiency. It sets SOTA on multiple benchmarks.
  • MolmoWeb is a visual web agent that operates on screenshots alone, surpassing larger proprietary models like GPT-4o-based agents on web tasks.
In-site article

OlmPool: How small architectural choices compound to undermine long context extension

OlmPool is a controlled suite of 26 models showing how small architecture choices can compound to make long-context extension much harder, even when training data and extension recipes are held constant.

  • Four architectural choices (QK normalization, GQA, sliding window attention, pretraining context length) each have modest individual effects but compound to cause up to 47% drop in long-context performance.
  • Standard training metrics fail to predict long-context performance; models that appear nearly identical can diverge by >26 points after extension.
In-site article

Introducing OlmoEarth embeddings: Custom embedding exports from OlmoEarth Studio for downstream analysis

OlmoEarth Studio now lets users export custom Earth-observation embeddings from our OlmoEarth foundation models and use them for tasks like similarity search, few-shot mapping, change detection, and unsupervised exploration.

  • New feature in OlmoEarth Studio: export custom Earth-observation embeddings.
  • Embeddings are compact numerical representations from open-source OlmoEarth models.
In-site article

A decade of real-time intelligence for the planet

For the past 10 years, Ai2 has built open, real-time tools that help people protect wildlife, oceans, and ecosystems around the world.

  • EarthRanger now covers over 900 protected areas in 95 countries, helping coordinate wildlife protection, including in northern Thailand using AI camera traps to reduce human-elephant conflict.
  • Skylight detects illegal fishing in real time using satellite imagery; Argentina has successfully enforced remotely, setting a precedent for ocean governance.
In-site article

Train separately, merge together: Modular post-training with mixture-of-experts

BAR is a recipe for post-training language models one capability at a time—train domain experts independently, merge them into a single mixture-of-experts model, and upgrade any expert without impacting the others.

  • BAR (Branch-Adapt-Route) enables modular post-training by training domain experts independently and merging them via MoE architecture.
  • Progressive unfreezing of shared parameters is crucial: embeddings and LM head for SFT, attention for RL.
In-site article

Evaluating agents for scientific discovery

Two benchmarks developed at Ai2 – ScienceWorld and DiscoveryWorld – reveal that even incredibly strong AI science agents struggle with problems human scientists solve routinely. ScienceWorld tests basic experiment execution, while DiscoveryWorld evaluates end-to-end scientific discovery. Current top models score ~80% on ScienceWorld and only ~20% on hard DiscoveryWorld tasks, compared to ~70% for human scientists.

  • ScienceWorld and DiscoveryWorld benchmark AI agents on basic lab skills and full scientific discovery processes.
  • Top models score ~80% on ScienceWorld, still not fully solving a 4th-grade science curriculum.
In-site article

Introducing WildDet3D: Open-world 3D detection from a single image

Ai2 releases WildDet3D, an open model for monocular 3D detection from a single RGB image that supports text, point, and box prompts, generalizes across cameras and object categories, and incorporates depth signals when available. Also releases WildDet3D-Data with over 1M images and 3.7M 3D annotations covering 13K categories. The model achieves 34.2 AP on Omni3D (text prompts) and excels on multiple zero-shot benchmarks.

  • Supports multiple prompt modalities: text queries, point clicks, and 2D bounding boxes
  • Achieves 34.2 AP on Omni3D with text prompts, a 5.8-point improvement over prior best
In-site article

All sources