2026-06-30 08:23 UTCIn-site rewrite3 min readUpdated: 2026-06-30 08:28 UTC

Ollama's New MLX Engine Doubles Performance of Local LLMs on Mac

Ollama's new MLX engine brings massive performance improvements to local LLMs on Mac, with up to 2x faster inference and better output quality, especially beneficial for coding assistants and agent workflows.

SourceHacker News AIAuthor: taintech

I have been using Ollama to run local LLMs on my Mac, and it has been working just fine. However, my Mac's overall performance took a hit because local LLMs are resource-hungry. I have a MacBook Air M5 with 16GB of RAM. It's probably not the most powerful machine for this kind of workload, but it's been good enough to run models with fewer than 7 billion parameters.

That changed completely after I upgraded to Ollama's new MLX engine. I'm seeing massive performance improvements. Everything feels much more responsive, and inference is almost twice as fast now.

If you're already running local LLMs on a Mac through Ollama, this is one of the biggest upgrades since Apple Silicon became a serious inference platform. The latest MLX engine changes how models are represented, how memory is used, and how agent workflows are cached, which also has a massive impact on coding assistants like Claude Code, OpenClaw, Aider, and other multi-agent setups.

I finally found a local LLM I actually want to use for coding

Qwen3-Coder-Next is a great model, and it's even better with Claude Code as a harness.

The MLX engine finally makes better use of Apple Silicon

It puts Apple Silicon to good use

Most local LLM users already know that Apple Silicon performs surprisingly well despite having relatively modest hardware. My MacBook Air M5 with 16GB of RAM handled smaller models without many issues, but the experience always came with trade-offs. Running a local model often slowed down everything else on the system.

Ollama's new MLX engine changes that by relying much more heavily on Apple's own MLX framework and unified memory architecture. As you know, Apple Silicon lets both the CPU and GPU share the same memory pool instead of treating them as separate pieces of hardware. The updated engine makes much better use of that design, reducing unnecessary memory movement during inference.

The improvements go beyond better memory management. Ollama now combines several GPU operations into larger Metal kernels via MLX's just-in-time compiler, reducing inference overhead. The engine also improves GPU-backed sampling, allowing tokens to generate much faster than before. Ollama claims the updated engine can deliver roughly 20% higher output speed than the previous Q4_K_M implementation, which matches what I noticed during daily use.

My own workflow never revolves around running large benchmark prompts. I usually spend my time asking programming questions, generating scripts, or testing automation ideas. Those workloads consist of many short requests throughout the day, and each one now feels more responsive.

Smaller models now produce better responses

Finally

Performance improvements usually receive the most attention, but I think the quality improvements matter just as much. Ollama's updated MLX engine now supports NVIDIA's model-optimized NVFP4 quantization format. Quantization reduces the memory required to run a model, but it also removes some information from the original weights. Lower memory usage usually comes at the cost of lower output quality.

NVFP4 reduces that compromise significantly. According to Ollama's own measurements with Gemma 4 12B, the new format reduces quality loss by roughly half compared to the widely used Q4_K_M format while maintaining similar memory requirements. The benchmark shows lower perplexity than Q4_K_M, which generally indicates that the model behaves much closer to its original BF16 version.

My Mac cannot comfortably run extremely large models, so I spend most of my time using smaller ones. Better quantization enables smaller models to produce stronger results without requiring additional hardware. That's a meaningful upgrade for anyone using a MacBook Air or another Apple Silicon system with limited memory.

I now notice that the generated code follows instructions more consistently, and follow-up prompts require fewer corrections than before. Responses also remain coherent over longer conversations, reducing the time spent rewriting prompts.

Coding agents benefit even more

Ollama redesigned agent workflows

The feature that surprised me the most has nothing to do with raw inference speed. Ollama also redesigned how its MLX engine handles cached model state during agent workflows. That's a big deal because coding assistants constantly resend huge amounts of context back to the model. Every tool call includes the system prompt, tool definitions, previous conversation history, and recently loaded files.

Traditional prefix caching only works while every request continues directly from the previous one. Modern coding agents rarely behave that way because they frequently branch into sub-agents, retry failed requests, or remove reasoning tokens from the visible conversation. Those changes normally force the model to process the same context repeatedly, even though most of it never changes.

Ollama addresses that problem with a new snapshot system. Instead of relying entirely on prefix caching, the engine stores reusable model states at important points during a conversation. Separate agent sessions can resume from those saved states instead of rebuilding everything from the beginning. Thinking models also benefit because snapshots preserve a useful state before reasoning tokens disappear from the conversation history.

Ollama is a lot better now

The new update improves everything you use local LLMs for, whether it's chatting with a model or using it as a coding assistant. My own local workflows feel much quicker because repeated tool calls no longer spend as much time rebuilding context. Faster response times, combined with better output quality, make the new MLX engine one of the most worthwhile upgrades I have made to my local AI setup.

Ollama is a platform to download and run various open-source large language models (LLM) on your local computer.

See at Ollama

Ollama is still the easiest way to start local LLMs, but it's the worst way to keep running them

Ollama is great for getting you started... just don't stick around.