AI News HubLIVE
In-site rewrite5 min read

Intelligence per Watt: A Unified Metric for the AI Era

Proposes 'Intelligence Per Watt' (IPW) as a metric for AI system efficiency, analogous to performance-per-watt in computing. Local models answer 88.7% of single-turn queries, and hybrid inference can cut energy and costs by 60-80%. IPW also measures economic value and national competitiveness via 'Gross Domestic Intelligence' (GDI).

SourceHacker News AIAuthor: ilreb

01

Vision Statement

From 1946 to 2009, computing efficiency—performance per watt—doubled every 1.5 years. This trend, documented by Koomey and colleagues, transformed where computing could happen. Workloads migrated from mainframe rooms to desktops, then laptops, then pockets. The transition from centralized time-sharing to personal computing didn't occur because PCs surpassed mainframes in raw performance. It occurred when efficiency gains made computing capable enough within the power constraints of personal devices.

We're at the same inflection point for artificial intelligence.

Today, most AI queries flow through centralized datacenters while demand grows at steep rates: 1300× increases in token processing, year-over-year scaling that strains power grids. Yet telemetry shows that 77% of requests are practical tasks—writing emails, summarizing documents, seeking information—that don't require frontier-scale models.

We propose INTELLIGENCE PER WATT (IPW)—task accuracy per unit of power—as a unified metric for understanding this transition. Just as performance-per-watt guided the mainframe-to-PC shift, intelligence-per-watt clarifies the path from centralized AI to distributed intelligence. IPW provides a common framework for studying three questions shaping AI's future:

Workload Redistribution: From Cloud to Edge

Local language models (≤20B parameters) now accurately answer 88.7% of single-turn queries, and consumer accelerators run them at interactive latencies. IPW improved 5.3× from 2023–2025—3.1× from model advances, 1.7× from hardware gains. By measuring intelligence efficiency across the model-hardware landscape, we can identify which queries belong on which devices. Hybrid systems that route queries appropriately cut energy, compute, and cost by 60–80% while preserving quality. IPW tracks this redistribution as it unfolds.

Economic Value: Measuring AI's Real-World Impact

Not all intelligence is equal. A model that handles graduate-level physics but fails at email drafting delivers different economic value than one with the opposite profile. By weighting IPW against GDP-relevant task distributions, we can quantify how much economic value AI systems generate per watt consumed. This lens reveals where current systems create value, where gaps remain, and how efficiency gains translate into productivity across economic sectors.

National Competitiveness: The Global AI Race

The nation that most efficiently converts energy into deployed intelligence gains advantage. We introduce Gross Domestic Intelligence (GDI)—the product of intelligence-per-watt and accessible power—as a framework for AI competition. China and the United States face inverse constraints: China is compute-bound by export controls on advanced chips; America is energy-bound by grid limitations and datacenter bottlenecks. IPW reveals an asymmetric American asset: hundreds of millions of local accelerators already deployed in homes and offices. This installed base could boost effective AI capacity 2–4× without new datacenter construction.

The path forward: Intelligence per watt should be a north star metric for model architecture, hardware design, and national strategy. We're building the measurement infrastructure, benchmarks, and systems to make this concrete—and releasing our tools for others to use.

02

The IPW Research Agenda

We're pursuing a coordinated research program to understand and maximize intelligence efficiency across the full stack.

Category Initiative Objective

Measurement & Benchmarking GDP-Weighted Evaluation Quantifying economic value generated per watt on real-world, GDP-relevant tasks.

Measurement & Benchmarking IPW Attribution Decomposing efficiency gains into algorithmic versus hardware contributions through continuous benchmarking.

National Competitiveness Gross Domestic Intelligence Identifying high-impact interventions across inference systems, power grids, and model architectures.

Models & Systems Post-training for IPW Training local models to use frontier models as tools for verification and sophisticated assistance.

Models & Systems Hybrid Inference Engine Building systems that automatically route work between local and cloud compute to maximize IPW subject to latency, privacy, and cost constraints.

03

Papers + Code

Publications

📰 Article

China's AI Heist

A Foreign Affairs essay on how the United States should respond to Beijing's unauthorized "distillation" of frontier AI models, and what safeguarding America's lead in AI will require.

Read in Foreign Affairs →

📄 Publication

Intelligence Per Watt: Measuring Intelligence Efficiency of Local AI

Introduces "intelligence per watt" (IPW) as a metric for measuring AI efficiency, finding that local LMs can answer 88.7% of single-turn reasoning & chat queries and that hybrid local-cloud routing cuts energy use by 64% and costs by 59% compared to cloud-only inference.

Paper (arXiv) → Blog Post →

📄 Publication

Maximizing American Gross Domestic Intelligence with Hybrid Inference

Proposes "Gross Domestic Intelligence" (GDI) as a framework for national AI competitiveness, arguing that the U.S. can boost effective inference capacity 2–4× by activating the 70–80M AI-capable devices already deployed in American homes and offices alongside cloud infrastructure.

Blog Post →

📄 Publication

OpenJarvis: Personal AI, On Personal Devices

An open-source framework for building personal AI agents that run entirely on-device, providing composable primitives for local AI systems that prioritize efficiency and privacy by keeping user data on personal hardware rather than routing through cloud services.

Paper (arXiv) → Blog Post →

📄 Publication

Minions: Cost-efficient Collaboration Between On-device and Cloud Language Models

Introduces protocols for local-cloud LM collaboration on long-document reasoning tasks, where MinionS reduces cloud costs by 5.7× while maintaining 97.9% of frontier model accuracy by decomposing tasks into parallelizable subtasks executed locally.

Paper (arXiv) → Blog Post →

📄 Publication

Archon: An Architecture Search Framework for Inference-Time Techniques

An automated framework for optimizing inference-time techniques in LLMs, exploring a large design space to discover optimized configurations. Archon-designed systems outperform frontier models such as OpenAI's o1, GPT-4o, and Claude 3.5 Sonnet by an average of 15.1% across instruction-following, reasoning, and coding tasks.

Paper (arXiv) →

📄 Publication

Weaver: Shrinking the Generation-Verification Gap with Weak Verifiers

A framework combining multiple imperfect verifiers to evaluate language model responses. Uses weighted ensembles of weaker verification systems with weak supervision to estimate accuracy, achieving competitive results with smaller models that approach the performance of advanced systems like o3-mini.

Paper (arXiv) →

Code + Tools

🔧 Code & Tools

IPW Profiling Harness

Open-source benchmarking suite that profiles LLM inference across NVIDIA, AMD, and Apple Silicon, measuring energy consumption, power draw, latency, and throughput to compute intelligence-per-watt metrics for any model-accelerator configuration.

GitHub Repository →

🔧 Code & Tools

OpenJarvis

Open-source toolkit for building and deploying personal AI agents on local hardware. Provides composable primitives, device-optimized model serving, and privacy-preserving pipelines for on-device intelligence.

GitHub Repository → Docs →

🔧 Code & Tools

Minions

Reference implementation for local-cloud LM collaboration protocols. Includes MinionS and Minion strategies for decomposing tasks across on-device and cloud models to reduce costs while preserving accuracy.

GitHub Repository →

🔧 Code & Tools

Archon

Architecture search framework for automatically discovering optimized inference-time technique configurations across LLMs, including generation ensembling, fusion, ranking, and verification strategies.

GitHub Repository →

🔧 Code & Tools

Weaver

Toolkit for building weighted ensembles of weak verifiers to evaluate language model outputs. Enables scalable verification using smaller, cost-efficient models with weak supervision techniques.

GitHub Repository →

Related Works

A collection of resources that inform and connect to Intelligence Per Watt research.

Algorithmic Progress in Language Models

How Fast is Algorithmic Progress in AI Inference?

LLM Inference Price Trends (Epoch AI)

Compute Equivalent Gain (CEG) Accounting

Inference Efficiency Analysis

Training Compute-Optimal Models

Green Grid Metrics

Zeus: ML Energy Measurement

AI Energy Score (Hugging Face)

Energy Considerations for LLM Inference

MLCommons Inference Benchmark

MLCommons Inference Policies

LLM Energy Measurement

ML Energy Measurement Tutorial

The Simple Macroeconomics of AI (Acemoglu)

Thoughts on AI and Economics (Boaz Barak)

How AI is Transforming Work at Anthropic

Remote Labor AI

LLM Labor Market Demand Analysis

GDPVal Dataset

Snorkel AI Leaderboard

IBM Enterprise Ops Benchmark

APEX Benchmark

INFaaS: Automated Model-less Inference

MIT Iceberg

Cisco Unified Edge Computing

LLM Router

Efficient Inference Routing

06

People

Principal Investigators

Christopher Ré

Principal Investigator

Azalia Mirhoseini

Principal Investigator

John Hennessy

Principal Investigator

PhD Students

Jon Saad-Falcon

PhD Student

Avanika Narayan

PhD Student

Master's Students

Herumb Shandilya

Master's Student

MH

Matthew Hart

Master's Student

Undergraduates

Hakki Orhun Akengin

Undergraduate

Tanvir Bhathal

Undergraduate

Gabriel Bo

Undergraduate

Adrian Gamarra Lafuente

Undergraduate

J. Wes Griffin

Undergraduate

Robby Manihani

Undergraduate

Andrew Park

Undergraduate

Industry Collaborators

Jared Dunnmon

Industry Collaborator

Chuan Li

Lambda Labs

Caia Costello

Lambda Labs

Sponsors & Labs

Sponsors

Labs

04

Blog

May 15, 2026 · Hazy Research ↗

From Minions to OpenJarvis: A Retrospective on Two Years in Local AI

A look back at two years of research on local AI — tracing the path from Minions through to OpenJarvis, the lessons learned along the way, and where on-device intelligence is headed next.

March 17, 2026

How Close Are Local Models to the Cloud? An OpenJarvis Benchmark Study

We used OpenJarvis to run a head-to-head evaluation of 8 local open-source models against 6 frontier cloud models across 5 representative use-case benchmarks. The headline: local models rank within the top 3 overall.

March 12, 2026 · Scaling Intelligence Lab ↗

OpenJarvis: Personal AI, On Personal Devices

An open-source framework for personal AI agents that run entirely on-device. OpenJarvis provides composable primitives, treats efficiency as a first-class constraint, and lets models improve locally from interaction traces while keeping user data on personal hardware.

November 11, 2025 · Hazy Research ↗

Intelligence Per Watt: A Study of Local Intelligence Efficiency

Introduces Intelligence Per Watt (IPW) as a metric for how effectively inference systems convert energy into accurate computation. Local LMs handle 88.7% of single-turn chat queries while IPW improved 5.3× over two years — pointing toward a shift from centralized cloud to distributed edge inference.

Avanika Narayan, Jon Saad-Falcon · March 17, 2026

TL;DR — We used OpenJarvis to run a head-to-head evaluation of 8 local open-source models against 6 frontier cloud models across 5 representative use-case benchmarks. The headline: local models rank within the top 3 overall, with the best local model (Qwen3.5:122B-A10B, 0.840 avg accuracy) matching or exceeding frontier cloud models like Claude Opus 4.6 and GPT-5.4. When you factor in that local inference costs $0 in API fees (you already own the hardware), the picture starts to get very interesting.

The Eval Setup

Tasks — We designed 5 use-case benchmarks that mirror

[truncated for AI cost control]