Apple rebuilt its on-device AI stack at WWDC 2026
WWDC 2026 brought no new silicon, but a structural rebuild of how AI runs on Apple silicon: a new inference framework (Core AI), a new model format (.aimodel), a new generation of on-device models (AFM 3), and a changed posture toward the cloud including a partnership with Google and NVIDIA. The most surprising tell: Apple's flagship cloud model runs on NVIDIA GPUs in Google Cloud.
WWDC 2026 brought no new silicon. What it brought instead was a structural rebuild of how AI runs on Apple silicon:
a new inference framework,
a new model format,
a new generation of on-device models,
and a noticeably different posture toward the cloud.
None of it was the headline - the headline was the consumer features. But the developer documentation, the session code, and one machine-learning-research post add up to a clearer roadmap than the keynote did, plus a few details that are genuinely odd.
I read this layer closely - I'm building a profiler for it - so here is what stood out: the major changes, the subtle tells, and the findings I had to double-check before believing. One ground rule up front: everything below is from Apple's own documentation, WWDC session pages, and research posts, quoted where it matters. Where something is an individual developer's claim or a forum reading rather than Apple's word, I say so. Where Apple simply does not say, I say that too.
And the biggest caveat of all: I'm in Europe, so I spent the night watching, reading, and researching - I'm sure I got something wrong due to lack of sleep. :-)
The big change: Core AI replaces Core ML for neural networks
For a decade, Core ML was the answer to “run a model on an iPhone.” At WWDC 2026 Apple introduced Core AI, and the framing is a handover, not an addition. Core AI's documentation sends the old cases back to Core ML:
“If your app uses model types other than neural networks, such as decision trees or tabular feature engineering, see Core ML.” - Apple, Core AI documentation
And Core ML's documentation sends the new ones forward:
“If your app integrates AI models using the latest architectures and inference techniques, see Core AI.” - Apple, Core ML documentation
Read together, that is a split: Core ML narrows to classic, non-neural machine learning - decision trees, tabular features - while neural networks and transformers move to Core AI, which Apple describes as the engine behind the product itself:
“Core AI allows your app to use the latest model architectures and inference techniques across the CPU, GPU, and Neural Engine.” - Apple, Core AI documentation
The subtle tell is in the tooling. Apple's new Core AI debug gauge carries a one-line restriction:
“The gauge does not support the Core ML framework.” - Apple, Core AI debug gauge documentation
The new instrumentation simply does not look at the old framework. Core ML is not deprecated - its APIs are intact, and there is real backward-compatibility value in that - but the center of gravity, and the tooling investment, has moved.
A new artifact: the .aimodel bundle
Core AI ships a new on-disk format, .aimodel, and the first odd thing about it is that it is not a file. It is a directory. Apple's open coreai-models repository treats it as one throughout - the Python exporter deletes an old one with a directory-only call, and the Swift runtime resolves it as a “.aimodel directory.” Inside the surrounding model bundle is a plain-JSON metadata.json (schema version 0.2) that records the model kind (LLM, VLM, diffusion, segmenter), the tokenizer, vocabulary size, context length, the compression preset, and which file is the model. That JSON is documented and parseable. The weight payload itself - the part that would tell you the exact per-tensor bit-widths - is written by an opaque framework call, and its byte layout is not published anywhere I could find. So the format is half-open: a readable manifest wrapped around an undocumented blob.
Models are prepared with new Python tooling - Core AI Optimization (coreai-opt, the successor to coremltools) for compression, and Core AI PyTorch Extensions (coreai-torch) to export straight from PyTorch into the format - then optionally compiled ahead of time with xcrun coreai-build compile into per-architecture .aimodelc assets. The compression menu is wider than the GGUF world's: integer weights at 2, 4, and 8 bits; float micro-formats including FP8 (E4M3) and FP4 (E2M1); block-scaled MXFP8; and palettization from 1 to 8 bits. One forum reader (HN, opinion) noted Apple is also pushing activation quantization like w4a8 / w4a16; given Apple's install base, the formats it blesses could end up shaping how sub-100B models ship to everyone.
The hardware tell: the matmul moved into the GPU
No new chip, but WWDC 2026 made the M5 and A19 GPU story explicit, and it is the clearest hardware signal of the week. From Apple's M5/A19 tech talk:
“Neural accelerators are dedicated hardware in M5 purpose built for matrix multiplication. They're built into each shader core right alongside the other GPU pipelines such as ALU, raytracing... Each shader core has its own neural accelerator.” - Apple, “Accelerate your machine learning workloads with the M5 and A19 GPUs”
Apple's numbers: matrix multiplications up to 4 to 8 times faster, LLM time-to-first-token up to four times faster (prefill, which is compute-bound), token generation up to 25% faster (decode, which is memory-bound). And the framing underneath is one local-inference people will recognize, because it is the roofline - now stated in Apple's own Metal Performance Primitives guide:
“GEMMs with low arithmetic intensity are memory bound workloads, and GEMMs with high arithmetic intensity are compute bound workloads, forming the basis of a roofline model for kernel performance.” - Apple, Metal Performance Primitives Programming Guide
Compute-bound prefill, memory-bound decode: that prefill-versus-decode split is now Apple's own language, not just a community heuristic.
A second tell is hiding in code rather than slides. The coreai-models source infers a model's preferred compute unit from its graph structure: chunked, static-shape graphs prefer the Neural Engine; dynamic-shape graphs prefer the GPU. That quietly formalizes the bifurcation Apple has been hinting at for years - the Neural Engine for static, classic-shaped work, and the GPU (now with a Neural Accelerator inside each shader core) for the transformer matmul. Worth stressing: that is the model's preferred target encoded at export time, not a guarantee of where any given run actually executes.
The model: AFM 3, and the bandwidth wall
Apple also introduced its third generation of Foundation Models. On device: a 3-billion-parameter dense model (“AFM 3 Core”) and a 20-billion-parameter sparse mixture-of-experts (“AFM 3 Core Advanced,” natively multimodal, activating just 1 to 4 billion parameters at a time, and gated to the most capable Apple silicon).
The interesting part is the memory section, where Apple states the constraint plainly:
“the full model is stored in flash memory (NAND)” ... “NAND-to-DRAM bandwidth is too slow to swap weights token by token.” - Apple Machine Learning Research, AFM 3
That is Apple describing the exact wall every local-LLM runner runs into: a model too big to keep resident pays for it in bytes moved per token. Their answer is a mixture-of-experts with a high share of always-active “shared experts” alongside input-dependent “routed experts” - keep the always-on weights in memory, stream as little as possible - with quantization-aware training compressing the rest. It is a reminder that Apple is not exempt from the physics; it is just unusually candid about it in a research post.
The boundary: on-device, cloud, and the opaque middle
Apple's foundation models now span a spectrum, from on-device to the cloud, and the cloud end has a surprising shape. From the AFM 3 post:
“we worked with Google and NVIDIA to extend Private Cloud Compute to NVIDIA GPUs in Google Cloud.” - Apple Machine Learning Research, AFM 3
And from Apple's security team:
“collaborating with Google and NVIDIA to run new Apple Intelligence workloads on Google Cloud.” - Apple Security, Expanding Private Cloud Compute
Apple's most demanding model runs on NVIDIA GPUs, in Google Cloud, built with Google. For a company that designs its own silicon and markets on-device privacy, the flagship cloud model living on a competitor's hardware in a competitor's cloud is the most surprising tell of the week.
The part I most wanted to confirm is the switch. When does a request run on the device versus go to Private Cloud Compute, and can you tell after the fact which happened? So I went looking. Apple's APIs expose explicit choices - a Private Cloud Compute model option, a dedicated PrivateCloudComputeLanguageModel type you deliberately adopt. What I could not find - in the Core AI docs, the Foundation Models docs, or the Expanding-Private-Cloud-Compute security post - is any statement of when an on-device request transparently offloads, or whether that routing is visible to the developer or the user. So the honest version of this finding: the spectrum is real, the cloud is Google plus NVIDIA, and the triggering mechanism and its auditability are simply not publicly specified. Make of the silence what you will.
What developers can see: timing
Core AI ships three tools - a standalone Debugger app, an Xcode debug gauge, and an Instruments template - and it is worth being precise about what they measure, because they do measure something real. The Core AI instrument, per the docs:
“profiles execution timing across the CPU, GPU, and Neural Engine ... such as which compute units run your model ... The trace correlates Core AI events with hardware activity.” - Apple, Core AI documentation
Latency, token counts, and which compute unit ran the model - inside Xcode, for your own app's Core AI calls. Energy, memory bandwidth, and thermal state do not appear anywhere in the Core AI profiling documentation. That is a statement about what the tooling reports, not a judgment - but it is a notable gap given how much of on-device performance is decided by exactly those three.
The other track: MLX
Running in parallel, Apple kept investing in MLX as the bring-your-own-weights path for power users. WWDC 2026 added distributed inference across multiple Macs (a new JACCL backend over Thunderbolt 5), an OpenAI-compatible mlx_lm.server, and an agentic-on-Mac story built around it. Tellingly, the MLX sessions draw no line back to Core AI or Foundation Models - a deliberate two-track posture: the system's own models on Core AI and Foundation Models, the open community's models on MLX.
(One color detail, flagged as such: a developer's screenshot of a new fm command-line tool - shown, they say, in the Platforms State of the Union - exposes an OpenAI-compatible fm serve and the same on-device-versus-Private-Cloud-Compute model split. Treat it as one person's capture, not documentation.)
The broader implications
Pull back, and the roadmap is legible.
On-device AI is now a first-class platform capability. The same inference engine that powers Apple Intelligence is a developer framework, with its own format, toolchain, and profiler. That is a bigger commitment than a feature.
The stack is fragmenting before it consolidates. Core ML, Core AI, and MLX now coexist, and developers said so within hours - the thread under the Core AI announcement is full of people asking which of the three to use and why. Apple shipped the frameworks faster than the story that explains them.
The hard problems are the universal ones. AFM 3's NAND-bandwidth admission and the prefill-versus-decode roofline are the same constraints every local-inference project fights. The interesting thing is not that Apple solved them; it is that Apple now describes them in the same terms the rest of us do.
The cloud boundary is the part to watch. A local-to-cloud spectrum whose switch is undocumented, with the cloud end running on Google and NVIDIA, is a trust-and-architecture question that will draw more s
[truncated for AI cost control]