AI News HubLIVE
站内改写

Two Years of Local AI on a Laptop: When Open Models Outpaced Moore's Law

From May 2024 to May 2026, the most expensive MacBook Pro stayed at 128 GB of unified memory, yet the smartest open-weight model running on it jumped from a score of 10 to 47 on the Artificial Analysis Intelligence Index—a 4.7× improvement, doubling every 10.7 months, more than twice the pace of Moore's Law. Gains came from sparse mixture-of-experts, aggressive quantization, and reasoning-tuned small dense models.

Article intelligence

EngineersAdvanced

Key points

  • Open-weight AI models on a 128 GB MacBook Pro improved 4.7× in intelligence score over 24 months, doubling faster than Moore's Law.
  • Two key breakthroughs: sparse MoE models (e.g., gpt-oss-120B) and small dense reasoning models (e.g., Qwen3.6 27B).
  • Hardware remained nearly unchanged; all progress stemmed from software and model design innovations.

Why it matters

This matters because open-weight AI models on a 128 GB MacBook Pro improved 4.7× in intelligence score over 24 months, doubling faster than Moore's Law.

Technical impact

May affect model selection, inference cost, product capability, and evaluation benchmarks.

Back to Articles

Two Years of Local AI on a Laptop: When Open Models Outpaced Moore's Law

Community Article Published May 11, 2026

Upvote

8

Mishig Davaadorj

mishig

TL;DR

Between May 2024 and May 2026, the most expensive MacBook Pro you could buy stayed at 128 GB of unified memory. The hardware ceiling barely moved. But the smartest open-weight model you could actually run on it went from a score of 10 (Llama 3 70B) to 47 (DeepSeek V4 Flash on antirez's mixed-Q2 GGUF) on the Artificial Analysis Intelligence Index.

That is 4.7× in 24 months, or a doubling of intelligence every 10.7 months.

Moore's Law (transistor count) doubles every 24 months. Local open-weight AI on a laptop has been improving more than twice as fast as Moore's Law, on completely unchanged hardware.

Smartest open-weight model on a 128 GB MacBook Pro Artificial Analysis Intelligence Index v4.0 (higher score better)

May 2024 Llama 3 70B ██████████ 10 Oct 2024 Qwen 2.5 72B ████████████████ 16 Mar 2025 Llama 3.3 70B ██████████████ 14 Oct 2025 gpt-oss-120B █████████████████████████████████ 33 May 2026 Gemma 4 31B ███████████████████████████████████████ 39 May 2026 Qwen3.6 27B ██████████████████████████████████████████████ 46 May 2026 DeepSeek V4 Flash ███████████████████████████████████████████████ 47 └──────────────────┘ │ Moore's Law would predict a score of ≈ 20 here (starting at 10, doubling every 24 months)

The hardware stood still

The premise of this post is simple. Buy the most expensive MacBook Pro on the market. What is the smartest open-weight model you can actually run on it, measured by a fixed benchmark? Repeat every six months for two years.

Chip release In market Max unified memory Memory bandwidth

M3 Max (Nov 2023) May 2024 to Oct 2024 128 GB 400 GB/s

M4 Max (Oct 2024) Nov 2024 to Mar 2026 128 GB 546 GB/s

M5 Max (Mar 2026) Mar 2026 to today 128 GB 614 GB/s

Three generations of flagship Max chips. RAM ceiling never moved. Memory bandwidth grew about 50 percent, which matters for decode speed, but does not change which models can fit in memory.

What changed was the models.

The five snapshots

For each timepoint I picked the smartest open-weight model that:

Was released by that date.

Fits in 128 GB at a usable quantization. Q4 is the default, but mixed Q2 schemes (IQ2_XXS for routed experts plus Q8 on attention, shared experts, and output) count as normal too.

Runs at 5 tokens per second or faster on the then-current top MacBook Pro.

All scores are Artificial Analysis Intelligence Index v4.0 against the full-precision hosted model. Artificial Analysis has rebased the index twice in this window, so older press release numbers are not directly comparable.

Date Top open-weight model Quant Score

May 2024 Llama 3 70B Instruct Q4 10

Oct 2024 Qwen 2.5 72B Instruct Q4 16

Mar 2025 Llama 3.3 70B Instruct Q4 14

Oct 2025 gpt-oss-120B (high) MXFP4 native 33

May 2026 DeepSeek V4 Flash IQ2_XXS + Q8 mix 47

The progression 10, 16, 14, 33, 47 is not linear. There are two discontinuities.

Discontinuity 1: sparse MoE arrives (August 2025)

For more than a year, the local ceiling was 70 billion dense parameters. Llama 3 70B, then Qwen 2.5 72B, then Llama 3.3 70B. The Mac memory bandwidth wall was the bottleneck: a 70B dense model at Q4 reads about 40 GB per token, capping decode at 8 to 12 tokens per second on M4 Max.

gpt-oss-120B broke this. 117 billion total parameters, but only 5.1 billion active per token. The MoE router selects a different subset of experts for each token, so decode is bandwidth-bound on only the active path. Result: 40 to 60 tokens per second on M4 Max, while the Artificial Analysis Intelligence Index score jumped from 14 to 33.

The model also shipped natively in MXFP4, which means there is essentially zero quantization quality loss on the local copy. The hosted benchmark and your laptop run the same weights.

Discontinuity 2: small dense reasoning catches up, huge MoE fits via Q2 (April 2026)

Two things happened within two weeks of each other.

Qwen3.6 27B (Reasoning) arrived on April 22, 2026. A dense 27 billion parameter model that scores 46 on the Artificial Analysis Intelligence Index. At Q4 it occupies 15 GB. On a 128 GB MacBook Pro, that leaves 113 GB of headroom for context, KV cache, or other apps.

DeepSeek V4 Flash arrived on April 24, 2026. 284 billion total parameters, 13 billion active. At full precision it does not fit on a laptop. But antirez published a GGUF using IQ2_XXS for routed experts (the bulk of the weights) and Q8 for attention, shared experts, and output. Total: 80.8 GB. Artificial Analysis Intelligence Index at full precision: 47.

Either of these would have taken the laptop ceiling above gpt-oss-120B. DeepSeek V4 Flash takes the headline by one point, but Qwen3.6 27B is the cleaner story: a 27B dense model that nearly matches a 284B mixture-of-experts on the same benchmark.

Open-weight models that fit on a 128 GB MacBook Pro, May 2026 Sorted by Artificial Analysis Intelligence Index v4.0 (higher score better)

Model Quant Size Score ───────────────────── ───────── ──────── ─────────────────────────────────────────────── DeepSeek V4 Flash Q2-mix 80.8 GB ███████████████████████████████████████████████ 47 Qwen3.6 27B Reasoning Q4 15 GB ██████████████████████████████████████████████ 46 Qwen3.6 35B A3B Q4 19 GB ███████████████████████████████████████████ 43 Gemma 4 31B Q4 17 GB ███████████████████████████████████████ 39 gpt-oss-120B (high) MXFP4 63 GB █████████████████████████████████ 33 GLM-4.6 Q2-mix ~110 GB █████████████████████████████████ 33 Gemma 4 26B A4B Q4 14 GB ███████████████████████████████ 31 GLM-4.5-Air Q4 57 GB ███████████████████████ 23

Models on Hugging Face: Qwen3.6 35B A3B, Gemma 4 31B, Gemma 4 26B A4B, GLM-4.6, GLM-4.5-Air.

Compared to Moore's Law

Moore's Law as originally stated covered transistor count: doubling every 24 months. Loosely interpreted as "capability doubles every two years", it gives a reference rate for technological progress.

Local AI on a MacBook Pro went from an Artificial Analysis Intelligence Index score of 10 to 47 in 24 months. That is 2.23 doublings, or a doubling every 10.7 months. More than twice the pace of Moore's Law.

If local intelligence had followed Moore's Law strictly, May 2026 would look like a score of 20, somewhere around Llama 3.3 70B territory. Instead it looks like DeepSeek V4 Flash at 47.

Even more strikingly: Moore's Law was about hardware getting faster. In this story, the hardware barely changed. All the gains came from software and model design.

Why it happened

Three ingredients did most of the work.

Sparse Mixture of Experts. MoE decouples model capacity from per-token compute. A 284 billion parameter model with 13 billion active per token reads roughly the same memory per decoded token as a 13 billion dense model, but holds far more knowledge in its weights. This is what made gpt-oss-120B and DeepSeek V4 Flash possible on consumer hardware.

Aggressive quantization as a normal practice. Q4 GGUF and MLX 4-bit became table stakes by mid-2024. The next step was mixed-precision schemes: IQ2_XXS on bulk routed experts combined with Q8 on attention and shared experts. This preserves quality much better than uniform low-bit quantization. The community now ships these by default, not as exotic experiments.

Reasoning-tuned small dense models. Qwen3.6 27B (Reasoning) at an Artificial Analysis Intelligence Index score of 46 is a dense 27 billion parameter model that comes within one point of a 284 billion parameter MoE. Better training data, better reinforcement learning recipes, and explicit chain-of-thought training pushed capability per parameter up sharply through 2025 and 2026.

What the next year might bring

Extrapolating at one doubling per 10.7 months gives an index score of roughly 75 by May 2027 on the same 128 GB laptop hardware. That assumes the architectural innovations keep arriving (no guarantee) and that the Artificial Analysis Intelligence Index does not get rebased again (possible, even likely).

The harder constraint going forward is the 128 GB ceiling. If Apple raises max unified memory in M6 Max, the curve has more room to run. If the ceiling stays put, future gains will come entirely from models getting smaller and smarter.

Caveats

The Artificial Analysis Intelligence Index was rebased twice in this window (v2 to v3 in early 2025, v3 to v4.0 in late 2025). Every score in this post is reconciled to v4.0. Press release numbers from 2024 quoting much higher absolute scores were measured on an older index.

The Artificial Analysis Intelligence Index is run against full-precision hosted endpoints. Your local quantized model is typically 1 to 3 index points lower for Q4 dense. The Q2-mixed quantization on DeepSeek V4 Flash has a slightly larger hit, cushioned by keeping the sensitive layers at Q8.

"Fits in 128 GB at usable quant" is shorthand. gpt-oss ships natively in MXFP4. DeepSeek V4 Flash uses IQ2_XXS plus Q8 in the published community GGUF. The shape of "fits" depends on quant tooling more than on a single bit-width number.

Context length eats memory. The 5 tokens per second floor is met comfortably at moderate context. Past about 10K tokens decode falls 30 to 50 percent on M4 and M5 Max, and KV cache eats RAM headroom fast. The larger MoE models get marginal past 64K tokens.

Reproduce this yourself

Every model in this post is on Hugging Face. The Artificial Analysis numbers are at artificialanalysis.ai/models. The MacBook Pro specs are on apple.com. The antirez DeepSeek V4 Flash GGUF lives at huggingface.co/antirez/deepseek-v4-gguf.

Pull them down. Run them locally. The numbers in this post will be a year out of date soon.

Models mentioned in this article 8

More from this author

10 Ideas That Turned a 45-Minute Training Run Into 90 Seconds

1

March 20, 2026

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

8

Models mentioned in this article 8