2026-06-28 08:14 UTCIn-site rewrite7 min readUpdated: 2026-06-28 08:20 UTC

The Real Cost of Using AI in 2026

This article analyzes four ways to access large language model tokens: subscription, pay-per-token API, cloud GPU rental, and self-owned hardware. Using personal usage data, the author finds that API is cheapest, subscriptions are heavily subsidized, cloud rental is most expensive, and self-hosting rarely pays off financially. However, owning local hardware provides privacy, independence, and insurance against future price increases.

SourceHacker News AIAuthor: adlrocha

Jun 28, 2026

A few weeks ago I wrote about the shift from GPU-poor to token-poor. Since this post, and the ones I wrote about my recent obsession with AI independence, a lot of people have asked me for advice about how they should access intelligence: “fine, but what should I actually do? Buy a subscription? Pay per token? Build a rig? Rent one?” I dodged the economics in the token-poor post, so let’s do them properly now.

The first thing I did before writing this post, is to pull my own token bill for the last 60 days (which have actually been slower than usual) in order to model my own token consumption (sidenote: BI built a really cool tool for this in nibble, my agent harness, that I can talk about in coming posts if someone is interested).

91% of my token spend went to expensive models I cannot run at home (and that I never will because of their size and them being closed). But I already held a yearly subscription, so why not use those first? The open models I “could” (and let me add quotes here for now) host myself, the Qwens, DeepSeeks, the GLMs and the Kimis, cost me around $30 over two months. Just. Thirty. Dollars.

This gap is what actually motivated this exercise and my whole post. What if I didn’t have that Claude subscription? How much would’ve cost me my access to intelligence, and what are the alternatives?

The four ways to buy intelligence

There are exactly four ways (that I could come up with) to get tokens out of a large language model.

The first (and probably, the most widely used) is a subscription. You pay a flat monthly fee, and someone else runs the model. This is the ChatGPT Plus, Claude Pro, or Kimi/GLM coding plans. Simple, capped, predictable, and as we’ll see, heavily subsidised.

The second is pay-per-token via an API. No flat fee, you pay for exactly what you consume, priced per million tokens in and per million out. This is generally what you use when you need to power your application with AI, or when you route an agent through a serverless LLM provider like OpenRouter, Fireworks, Together, etc.. I’ll let you correct me in the comments, but I would say people leveraging agents on their day-to-day prefer the predictability of subscriptions than paying-per-token.

The third is renting a GPU in the cloud. You spin up a machine by the hour, load whatever open-weight model you like, and serve it yourself. RunPod, Vast, Lambda. You’re not paying for tokens, you’re paying for raw compute time. With this you don’t need to think about the amount of tokens you are consuming anymore.

The fourth is owning the hardware. You buy the silicon, it sits in your house, and the marginal cost of a token is your electricity bill. As you know, this is the quest I’ve been on for a while now.

I really think each of these wins in a different regime. The tricky thing is knowing which regime you’re in.

The numbers, at real usage

For this exercise, I tried to be as objective and practical as possible. Let me put my own usage through all four and show the maths. My pattern is moderate and mixed: coding, writing, research, a few hours a day, heavy on context. I don’t have token-heavy long-running loops, and all my LLM-powered crons are routed through my local Qwen (not considered for this analysis). From all my consumption, there are roughly 78 million input tokens and 13 million output tokens a year on the replaceable tier, the open models I could plausibly self-host. For personal and professional reasons, this month has been slower than usual in my use of tokens, but it allows me to set a good baseline floor of my usage.

Here’s what that year costs, four ways:

Pay-per-token API: DeepSeek V4 Flash at $0.14 in / $0.28 out comes to about €13 a year. At my current messier mix of open models, call it €130. Either way, low triple digits at most.

Cloud GPU rental: an MI300X with enough memory to hold a 100GB model runs about $1.99 an hour on RunPod. Realistically, we would need at least 200GB of VRAM to run something of the level of the open-models I use through the API. At ninety hours a month that’s roughly €2,300 a year, plus storage so you don’t re-download the weights every cold start.

Own hardware: a usable DIY server is around €2,900 up front for one GPU, and after that maybe €30 a year in electricity. I’ve been looking to build myself an AMD-based rig with at least 4 GPUs equivalent to the RTX3060 or RTX5090 and that takes you to around €12,500. A pair of DGX Sparks, the configuration people actually want, is €9,600. I also love the tinybox red v2, but that’s $12,000 for only 64GB of RAM (and I am not sure about how upgradable it is with a lot of tinkering).

Subscription: whatever your flat fee is, for models the other three options can’t touch at this quality.

Let’s look at those numbers for a second. The API is cheaper than everything by two orders of magnitude. Cloud rental is the worst option on the board, because at ninety hours a month you’re using the machine 12% of the time and paying as if you owned it, but it is true that you don’t have to pay a lot upfront. And the hardware saves you, against the API, almost nothing. I computed the break-even on a €2,900 rig versus a triple digit a year of API tokens, and it is measured in decades (not great).

If your usage looks anything like mine, where I have bursts of high taken consumption, and then calmer periods that I use to focus and think on things that do not require that many tokens, then the decision (at least today, and only looking at current 2026 numbers) is pretty straightforward: pay per token, keep the subscription for the smart stuff as long as it is subsidised, and don’t build anything yourself.

So why am I still thinking about building something myself?

Why subscriptions are the deal in the AI trap

First, let’s chat about something that everyone is talking about, but I want to be explicit about: at today’s prices, the subscription is a gift, and you should take them (I don’t know how to make this bolder, I was tempted to highlight and use a red font).

Using frontier models per output tokens costs real money. Claude Opus is $25 per million out, Fable is (was) $50. If you actually metered a heavy month of frontier chat at API rates, it would dwarf a $20 or even $200 subscriptions. We are being subsidised, and the size of it is startling once you read the analysis that others have done. David Rosenthal, citing SemiAnalysis, shows how for $200 a month you can burn $8,000 in Anthropic tokens or $14,000 in OpenAI tokens. That’s a subsidy of 40 to 70 times the price you pay per token. He calls it the drug-dealer’s algorithm: give the product away until the customer is hooked, then find the price later (you’ve probably read me say that, “we are not in an AI bubble but an AI trap”). OpenAI reportedly turned $13 billion of revenue into $34 billion of costs last year, so “later” is doing a lot of work. A trap built into a bubble (another of those cool sentences that people would assume are LLM-generated but that I came up with myself. This is me giving myself some self-kudos :) ).

But is building your own inference infrastructure the solution for this? Dylan Patel (founder of SemiAnalysis) thinks that local hosting will never be an option, as it needs far more silicon than shared infrastructure. A provider runs one GPU hot across thousands of users and amortises it across all of them. You at home run one GPU at 10-30% utilisation. The subsidy you enjoy is partly just better economics that you physically cannot reproduce alone, and part this “drug dealer algorithm”. Big labs are not covering costs right now, and someone building infrastructure at home is competing with the bidding power of these big pockets (which is what is happening to me). What the hell, even Apple had to raise prices this week due to (allegedly) the AI mania.

So my immediate recommendation, all other concerns aside, take the subsidy while it lasts. It’s real money in your pocket today, and refusing it on principle is just leaving value on the table. I was so afraid of raising subscription prices since last year that I’ve been hedging my access to intelligence by buying yearly subscriptions. They could change the quotas or the models, but at least I had access for one more year. And this is something I am still doing when it’s time to renew.

But I would also advise you to not build your life on the assumption these prices last. The cheap subscription is customer acquisition, not the steady state, and the history of every platform that ever ran below cost to win a market tells you what comes after the market is won. I really hope that I am wrong, and the economics of AI inference change, but you should either be prepared to pay more for your subscription, or find a way to hedge your access to intelligence at a reasonable price.

When owning actually makes sense

Here’s where I have to be honest with myself, because the maths clearly say that it is not a good idea to build an inference infrastructure yourself, but I keep wanting to build it anyway.

I realised owning hardware is not a cost decision, it is the AI version of “owning Bitcoin is not a good investment”. At normal usage it never pays back at current subscription and token prices, and anyone telling you it does is selling something or hasn’t run their own numbers. It is a value decision. You buy it for three things the cloud can’t sell you: privacy, independence, and tokens with no meter on them.

The good news is that the floor price for that has fallen hard, because the open models got genuinely good. A 128GB unified-memory box, a Ryzen AI Max+ machine or a high-end Mac, starts around €2,500 and runs models that were complete science fiction a year ago (but we’ve got too used to smart models). DeepSeek-V4-Flash-REAP-180B, a pruned and quantised version of a 641-billion-parameter model, fits in 97GB and runs on a single such box at 14 to 24 tokens a second. One person who switched to it described Qwen3.6-35B as great for small projects but falling apart past 100K context, then said of the DeepSeek model: “this thing is actually smart, coherent at long context, just works.” Unsloth recently squeezed GLM-5.1 by 85% so it can run on a 256GB Mac. And I recently came across Step-3.7-Flash that looks great.

The kind of open-source models that you can run on less than 10k of compute are still quite limited. But the open-source and local inference community are moving super fast, and less than five grand, today (let’s see what happens to hardware prices a few months from now), buys you a private model on your desk that handles most of what isn’t frontier work. And I am not talking about a toy like “how cute, I can run an LLM at home”, but an actual working tool.

Additional disclaimer, I am looking at this as an individual. If I was doing the analysis for my own company, where I had the funds for proper hardware, and the infrastructure would be shared by other people (or is the core of my product), then I would definitely lean towards local (or cloud) inference.

The real hedge: own some compute anyway

So here’s the practical conclusion of my exercise after running the math.

Owning a decent local box is insurance, not an investment. Insurance is supposed to look like a bad deal on the spreadsheet (like Bitcoin). You don’t buy it expecting a return, you buy it because the thing it protects against is catastrophic and you can’t fix it after the fact.

And the things this local box will protect you against are getting more real than a few months ago (I’ll let you call me paranoid in any case, many have done it already, including my wife). Models get censored or quietly lobotomised on whole categories of topic (right, Fable?). Providers deprecate the exact version your workflow depends on, and the replacement behaves differently. Or you simply lose access in your geography: export controls tighten, a provider decides the EU isn’t worth

[truncated for AI cost control]