2026-05-10站内改写

In a quest to becoming AI independent

The author explores AI independence after GitHub Copilot's shift to usage-based billing. By analyzing the economics of AI, he decides to invest in local inference hardware to reduce reliance on big AI providers. The article details hardware options like Mac M3 Ultra, 8× Nvidia RTX 3090, and Ryzen AI Max+, and explains the memory bandwidth bottleneck in inference.

Article intelligence

EngineersAdvanced

Key points

GitHub Copilot's usage-based billing reveals AI companies' strategy to build dependency through low prices.
The author argues the AI bubble is a trap, suggesting local inference to reduce dependence.
Memory bandwidth is the key bottleneck for inference performance, not raw compute.
Compares hardware options: Mac M3 Ultra, 8× RTX 3090, and Ryzen AI Max+.

Why it matters

This matters because gitHub Copilot's usage-based billing reveals AI companies' strategy to build dependency through low prices.

Technical impact

May affect model selection, inference cost, product capability, and evaluation benchmarks.

May 10, 2026

A few weeks ago, GitHub announced that Copilot is moving to usage-based billing. No more flat subscriptions, from now on everyone has to pay for the tokens they use.

If you’ve been using Copilot on the free tier or an individual plan (like it was my case through a benefit to active open-source contributors), this probably stings. This subscription was the perfect way to test every new model without having to commit to specific subscriptions, and with an extremely generous monthly quota. I know of many people that bought Github Copilot subscriptions over Anthropic ones because it gave you access to Sonnet and Opus with higher quotas than those provided in Claude. So the obvious question is, why was it so cheap?

The answer is definitely not generosity. It is well-known that AI labs and big tech have been subsidising token costs for the same reason any platform subsidises onboarding: to build dependency before they extract value and crush their competition. Every cheap API call is also a training data point. Every workflow you wrap around their service is a switching cost they’re accumulating on your behalf. GitHub Copilot at $10/month was never a sustainable product, like is probably the case for more popular products like Claude Code and Codex. It was a land grab dressed up as a subscription. The cost per user of all these AI subscriptions (at least from the well-funded companies that can afford it) significantly exceeds the price of their subscriptions.

My most loyal readers know how I’ve been concerned about the economics of AI for a while. In this post I already made my argument about how I think “the AI Bubble is more a trap than a bubble”, and how by accelerating the adoption of AI for our daily workflows, companies are trying to create a dependency that they can leverage. When I realised this by the end of last year, I decided to start buying hardware that I could use to run local inference in order to start minimising my dependency from big token bills and subscriptions with decreasing token allowances.

My journey started with a Strix Halo chip, the Ryzen AI Max+ that has become my daily driver and gives me up to 128GB of unified memory. This machine allows me to comfortably run Qwen3.6-27B and Gemma 4 locally for my LLM-powered background tasks. Think email and calendar digest, meeting summaries, TTS, etc., the kind of assistant and automation work that doesn’t need a fast feedback loop or large contexts and can run continuously in the background. This allows me to prevent an increased AI bill, and to unnecessarily drain the token quota of my subscriptions, which I desperately need for more complex agentic tasks.

While this setup works fine for this kind of use case, it has shown to be quite annoying when you want to start leveling up your game and let your agents start relying exclusively on local models. The key problem is throughput. Even if the model fits in memory, as soon as you need to support an application that requires large context, tight feedback loops like agentic coding, auto-research tasks, real-time tool calls, or even running OpenClaw or Hermes agents, the tokens per seconds required to make the experience bearable (at least for me) aren’t there yet.

Fortunately, this gap is solvable, but today it may cost a few thousand dollars. So before spending a few “Ks” on hardware I wanted to be really sure and understand the setup that would give me what I need. This post is my public report of all my findings.

How inference actually works

But before we get into the hardware, it’s worth refreshing what “inference” actually requires, because the specific hardware requirements that matter, and how they impact your user experience, may not be the ones that intuitively many people think.

There are three main resources in play at inference: memory capacity (whether the model fits at all), memory bandwidth (how fast weights and caches stream into the compute units), and raw compute (how fast those units do the maths). Most people focus on the third one, while the bottleneck is almost always the second.

Here’s why. An LLM generates text one token at a time, autoregressively. Each token requires reading a large chunk of the model’s weights from memory into the processing units. The weights themselves don’t change (you’re not training, you’re reading). Which means the question isn’t “how many FLOPS can this chip do?” but “how fast can it stream data from memory?“ That memory bandwidth is what matters, measured in GB/s.

To give you some numbers that can help you build your intuition, an RTX 3070 with 8GB of VRAM has 448 GB/s of memory bandwidth. A newer RTX 4060 Ti with the same 8GB has 288 GB/s. For inference throughput, the 3070 which is older and cheaper, can be faster at inference as long as it can fit the model. This is counterintuitive until you understand what’s actually being measured. Apple understood it early, even if by accident, with the unified memory architecture in M-series chips, where CPU, GPU, and Neural Engine share a single high-bandwidth pool with no bus crossings, turns out to be nearly optimal for exactly this kind of workload. This is what makes Apple devices with M chips so good at inference. I wrote about why a few weeks ago.

The other bottleneck you need to understand is the KV cache. When a model processes a long conversation or code context, it caches the key and value vectors from each attention layer for every token it’s seen so it doesn’t have to recompute them. This cache grows with context length. At 200k tokens, it’s roughly 2GB with FlashAttention on, something manageable. But without optimisation, long contexts can eat most of your VRAM before the model weights even load. Newer architectures like Qwen3.6 address this directly: only 10 of the model’s 40 layers use full KV cache, meaning going from 4k to 65k context adds roughly 800MB of VRAM rather than several gigabytes. Architecture decisions like this are why “how much VRAM does it need?” is a question that increasingly depends on which model you’re running, not just how many parameters it has. If you want a deeper view on how transformers and KV caches work, I also shared a brief overview with external pointers on this post.

What does this mean for agentic work specifically? Tok/s matters more than it does for a chatbot. When an agent is executing a loop (calling a tool, parsing the output, deciding the next step) latency compounds. At 5 tok/s you’re waiting seconds between loop iterations. At 40 tok/s the loop feels instant. The difference between a useful coding agent and one you give up on is often that narrow. And this is the pain that I am feeling with my current setup. These half hundred tok/s is what I want to aim for with my next setup.

What the hardware market looks like

I’ve spent a long time in the weeds on this, and a lot of my thinking has been shaped by 0xSero’s detailed breakdown of the current market, and all the experiments he keeps sharing publicly (if you don’t follow him already and you are interested in local inference I highly recommend you do it right now. And 0xSero If you end up reading this, I can’t thank you enough for your contributions and all the good you’ve done for the open-source AI and local inference community). Here’s how I’d summarise the options as of mid-2026, capped at roughly $10k for an end-to-end inference machine built upon 0xSero’s analysis and benchmarks, and my own research.

Before I share the actual builds, here’s a summary table with the high-level hardware numbers from the previous section. As a reminder, memory capacity tells you which models fit, memory bandwidth tells you how fast they run. The table below puts those side by side so you can read the trade-offs against the metrics that actually matter.

With that framing, here’s the detail on each.

Source: 0xSero

Mac M3 Ultra

The cleanest option. Apple Silicon’s unified memory architecture (CPU, GPU, and Neural Engine sharing a single high-bandwidth memory pool) turns out to be nearly ideal for inference. No bus crossings, no transfer overhead. MLX has matured significantly in the last few months (as I described here) and is approaching the throughput of an Nvidia 3090 on comparable tasks. At 400W peak, the whole machine uses less power than a single overclocked 3090.

The biggest advantage is capacity: 512GB of usable memory means you can run Kimi-K2, Deepseek, and Minimax-M2 at full context, without extreme quantisation. Network two of them and you hit 1TB, something that would cost north of $50k with Nvidia. Scaling is quite clean in this case, each additional machine is its own self-contained unit with its own software stack connected through Thunderbolt/Ethernet.

The key limitation here is the lack of CUDA support. A lot of tooling in the inference ecosystem like vLLM, SGLang, the training and fine-tuning stack, assumes CUDA. MLX is good and getting better, but its level of maturity is still not close to CUDA’s. If you want to also fine-tune or train on your inference box, this may not be the best solution. But for inference? It’s great!

8× Nvidia RTX 3090

This is the power-user option, and the one that requires the most assembly work. There is no pre-built version of this; you are building a workstation from parts.The shopping list looks something like this: a server-grade motherboard with at least eight PCIe slots (something like a Gigabyte MZ32-AR0 or Supermicro equivalent, $800–1,200), a server chassis or open-air mining frame ($200–400), a 2,000W+ PSU or dual PSU setup ($400–600), 256GB of DDR5 system RAM for MoE offloading ($400), and eight RTX 3090s at roughly $800–1,000 each used. Total: $9–12k if you buy carefully, more if you don’t (which is always my case :) ). You will spend a weekend on this. Then another weekend on NVLink bridges and driver configuration.

What do you get in exchange? 192GB of VRAM at 936 GB/s of aggregate bandwidth, the fastest throughput on this list for dense models. Full CUDA support means vLLM, SGLang, and anything else the ecosystem has produced. A mature ecosystem and a box where you can also train and fine-tune.

The main downsides of this setup is that at full tilt the system draws 1,500W even with cards capped at 50% power limit. It will be quite noisy. The used 3090 market is tightening. Scaling beyond 8 cards requires an electrician and a second system. Think of this as a serious workstation close to data-centre level, not a quiet office machine.

If you like hardware and building your own machines, this is a really fun project. But if you don’t have the time this one is probably a pass for you, even if the economics per GB of VRAM add up.

Ryzen AI Max+ / Framework Desktop

This is the chip in my own Beelink machine. Framework sells a desktop configuration with 128GB starting at around $3k, expandable in 128GB increments up to 384GB really similar to the one I have. Mine includes 128GB, and you can buy it configured and it arrives ready to run, no assembly or heavy work needed. The power draw is modest, it’s quiet, and the RAM expands by swapping sticks rather than adding cards. I’ve been running non-stop for the last six months without noticing anything on my electricity bill.

The same chip, the Strix Halo, is what 0xSero describes as bringing the cost-per-GB-of-memory down “an absurd amount” relative to Nvidia. At 128GB you’re past the capability of four 3090s for half the price and a tenth of the hassle. Simon Couch has a good post showing what day-to-day local agent workflows look like on this class of machine. The memory architecture is similar in principle to what Apple is doing, unified pool, high bandwidth, no bus penalty, which is exactly why it’s competitive on inference despite the software friction.

The catch: ROCm instead of CUDA. AMD’s software stack has improved considerably, but it still requires more configuration than CUDA-based workflows,

[truncated for AI cost control]