Local Qwen isn't a worse Opus, it's a different tool
The author, a founder of a small software business, shares real-world experience with local models like Qwen. He argues that while local models lag behind frontier models on benchmarks, they offer unique value in privacy, fixed costs, and vendor risk avoidance. He also candidly discusses limitations like infinite loops and hallucinations, warning against using them for unsupervised long-horizon tasks.
We've all heard people say that local Qwen 27B or 35-A3B is "near-Opus level", but I have receipts from a software business and open source projects, and am here to be transparent with you.
This post is long-form for a reason. It's not a cursory glance, an unsubstantiated claim on X about cancelling Claude Max, or a hobbyist report from a model running at single-digit tokens per second with a 32K context window. It isn't written by a famous CEO tweeting about coding from an airplane.
It's my journey as a founder in a small software business, where local models have produced real, caveated value. I have skin in the game, but no incentive to push either cloud or local models, and a strong desire for local models to become capable and reliable.
I'll cover how the card paid for itself in the first two or three months, how it keeps serving our specific business use case, why I still can't trust it unsupervised, and Qwen's worst trait: the infinite loops and hallucination risk. These show up most when you quantize it down to fit a consumer GPU.
Figuring out the power connectors for the RTX 6000 Pro
On my use case for AI
My journey as a maintainer and founder started with OpenFaaS - built completely by hand, as was all software in 2016 up until recently. That meant laying down the core of the project on my own, then inviting others to participate through community - not because I couldn't do it on my own, but because my goal was to build a successful open source project. Around 2017 I tried to fund my time by joining VMware, and in 2019 after changes in the market, I needed a way to fund the work myself, so moved towards open-core and built a bootstrapped company. Today our small team maintains OpenFaaS, SlicerVM - AI sandboxes and "the missing API for Linux", Actuated.com - self-hosted CI runners for GitHub/GitLab, and Inlets.com - self-hosted HTTP/TCP tunnels.
These products use very low level Linux primitives like containers, Kubernetes, Firecracker microVMs, and networked protocols. If you squint, they're all opinionated infrastructure products focused on: efficiency, user-experience, control and autonomy. They're written in Go, and some have React-based UI components, landing pages, docs, agent skills, and CLIs. Along with the code, we also provide the best-in-class support, because we are lean and willing to do things that don't scale to help customers.
I've been using AI tools for as long as they've been available - from tab completion in VS Code in the early days, through to getting ChatGPT to generate chunks of code, or find bugs, to living in tmux 12 hours per day. I found myself in tmux so much of the time that I wrote a free tool Superterm.dev to keep track of my sessions, notes, and to get visual feedback from coding agents. Over that time, I've seen the capabilities go from "reduce boilerplate" to "design, architect, and test end to end". It's Claude or Codex that do the majority of my work, and whilst I insist on doing my own writing, I rarely write code by hand - as much as it pains me to say that.
A turning point for frontier intelligence
I'd say it was roughly between November 2025 and January 2026 that we saw a turning point. Many developers on X started to espouse Claude Opus as having changed and how it was now capable of doing all of their work. Manual coding turned bad as quickly as milk sours left out the fridge. The costs of the top-end coding plans settled at roughly 200 USD / mo for individuals. A real number, but tolerable for the value they generated. Even today, if you avoid too much unattended work, you can make it last through the 5 hour limit, and weekly limit if you're careful.
What makes local models interesting
There's an argument that says: "Why use anything less than the best you can afford?"
The year of 2026 certainly is a new frontier: we find ourselves in a place where any idea can be cloned overnight by someone you've never heard of with a subscription in a developing nation. I've seen it happen to our SlicerVM product (originally written by hand in 2022) and Superterm (new in 2026, 100% written by coding agents). It's not to say that a vibecoded clone is a 100% equivalent of a well engineered and architected solution with an experienced team supporting it, but a market where the cost of software went to nil - free and good enough can be all that matters.
So in such a competitive landscape, why limit yourself to something that's worse? Isn't that an opportunity cost? Isn't that risking your livelihood?
There are estimates that the leading models contain between 0.5-2T parameters. That's not just "marginally more" or a "few times more" than the best in class for local hardware - that's on a different level. The parameter count is a rough proxy for capacity, knowledge, and reasoning ability. Yet somehow, even a tiny dense model like Qwen 3.6 27B is able to score a reputable benchmark of 77.2 on SWE-Bench Verified vs 88.6% from Claude Opus 4.8.
So you could be forgiven for taking to X and shouting loudly that "local is only 12% behind SOTA" and many have, including engaging one-shotted demos of space invaders. You may go as far as claiming that a single 6-year old GPU can replace your 200 USD / mo ChatGPT Pro subscription, and indeed many have made that claim.
Benchmaxxing
Benchmarks are a moving target, and since they're widely available, it's possible to educate and tune a model to obtain a higher score than they would otherwise on these tests. The classic SWE-Bench Verified benchmark is based upon a set of Python issues across a number of Open Source projects. Python has threads, and async, however most code you run into is single-threaded and synchronous. In contrast, we write distributed systems in Go, where channels, contexts, and structs span across a large execution domain.
Cost
There's a very popular take "local models aren't about cost" and that comes from a position of privilege. Individuals can use coding plans that provide high amounts of usage through a working day for 200 USD / mo. On that basis, you are getting SOTA level intelligence, the best chance of something working and being of quality, of finding that bug, or generating that landing page.
Coding plans are clearly subsidised, just look at what happened to GitHub Copilot plans. They started off by giving away 1500 requests for 39 USD / mo and you could make that last a very long time for pennies. Something that was undisclosed changed at GitHub/Microsoft/Azure, and they moved everyone over to token-based pricing and the backlash was huge. The true cost had been hidden for so long, we'd become accustomed to it.
Now, if you're paying for tokens on API rates, the breaking point comes sooner than many of us realise. Recently, Uber capped spend to 1500 USD / mo per developer per tool. The median salary at Uber is 330k USD annually, so if a developer used two tools to the maximum extent, it's roughly 12% of their annual compensation.
So for heavy use, loops, agentic analysis, in-product capabilities deployed through SaaS systems, open weight, or local models can provide serious value. It's not fair to rule out cost, but for many it's not about that.
Sovereignty and privacy
We work with various enterprise customers that take data controls very seriously. If you squint at our product line, we're all about privacy and sovereignty. OpenFaaS runs functions on your infrastructure, with your limits and preferred languages, and events. SlicerVM runs microVMs not on some abstracted cloud-based bare-metal, but on your own kit, even your MacBook. Inlets runs tunnels where you can control the tunnel client and server with 100% privacy. Actuated takes the arduous parts of GitHub Actions away and says "install an agent on your machines and forget about it".
So naturally, we are drawn to local models - both from our core values and beliefs about how the Internet should be, but through obligations.
You may not hold these beliefs, you may not handle any customer data, but if you live outside of the US, the removal of Anthropic's Fable 5 model overnight might have come as a shock. In other words, there is serious vendor risk, and many of us are addicted to the source.
Local models are the solution to "What if the frontier labs do X?"
Tempering the blade
I said that local models are not the same tool as SOTA. What did I mean by that?
I build furniture using hand tools, and occasionally just like I'll release an open source project to scratch an itch, I'll make an edge tool like a chisel, a grooving plane blade, a scratch awl, a Sloyd knife for carving.
Tempering a Japanese style marking knife on the back of a heated file, until it hits straw colour.
There are two ways to work with steel depending on how much you can invest. Forging is taking a raw piece of steel, heating it up and smashing it with a hammer into the form you need. It's seen as the most pure and honourable way to work - the "real way". Then for smaller items, "stock removal" is much more approachable. It involves taking sheet steel, cutting out a shape and grinding in a bevel or a point.
But that's just the shaping. You then have to heat the steel up, and quench it in oil or water. This makes the steel become extremely hard, so hard that if you dropped it - it would shatter into pieces. So we have to scrub off the black scum, and heat it up again, watching for a rainbow of colours. If we go one shade past where we need, we have to start the heat treating all over again.
Our team's experience of local models is exactly like missing the temper colours. The model is running so hot, that it shoots past the goal and starts looping. Nothing can fix it, other than closing down the harness and hoping the cleared context will give a different result.
I'd never leave a blade tempering unattended, just like I'd never leave Qwen 3.6 27B working on a long horizon task. For steel the workaround is using a kiln, or temperature controlled oven to remove variability.
That Sloyd knife we forged could be used to knock in nails, but you're likely to cut your hands and ruin the edge at the same time. Let's go back to the start, if it's a different tool, what is it good for?
What I was looking for
I was looking for all of the things we covered in the previous section: privacy, fixed costs and protection against vendor risk. Where I got and continue to get let down is where I treat a local model inside opencode in the same way I treat Claude or Codex. It's almost creepy how long they can work fully unattended whilst making real progress towards a goal.
I can paste in something like: "Eoin told me he has been running Slicer VMs in a loop and ran out of FDs. He suspects VSock" and then after a couple of minutes Claude replies "Now I see the full picture: You're doing X, you need to do Y". I say "do it and test it end to end on my mini PC" and after any period of time - 5 or 15 minutes, I can raise a PR, have it code reviewed automatically, and then tell Claude to read it and iterate again.
It's a wonderfully efficient loop for a small team like us that manages multiple products and works very closely with enterprise and community users.
Sharp lessons from a 3090
I started off with a single 3090 card in 2023, and quickly realised I needed another to be able to load models and have sufficient context. Nothing about local models from 2023 is worth covering here, other than they were so hard to use that I gave up on them. Qwen 3.5 was the first time I saw real work being done by agents.
I could load a model into either card in Q4 quantization with 200k context (also quantized) and get it to do small tasks, when guided. I still remember how quickly that went south. I told the model "Explore this machine from every angle, complete a forensic report on the machine and how it's used" - Claude would have shrugged that off. Qwen started reading every single file on my machine one by on
[truncated for AI cost control]