The frontier is open-source today
GLM-5.2 outperformed Opus 4.8 on an AI-resistant take-home test, leading to the release of offmute-v2, an open-source transcription pipeline that fuses STT with multimodal LLMs. The article details the experiment, comparison, and caveats.
GLM-5.2... was able to single-shot our backend take-home to a higher quality than Opus 4.8. This was a take-home designed to be AI-resistant, and while Opus and GLM turned in reasonable results, the GLM version worked out of the box, had better transcriptions and speaker identifications, followed instructions more closely, and produced more maintainable code.
Thanks to both of these models, I'm happy to release offmute-v2 today - something I've wanted forever. It combines the learnings from offmute (our most daily-driven tool), meeting-diary, ipgu and other projects into a multi-step pipeline that fuses a regular STT model with a multimodal LLM, which gives us timestamp correct, diarized transcripts with identified speakers. Runs anywhere, even in the browser (so I can integrate it into things), extensible and easy to plug providers into. Instructable so I can fix common misspellings or ask the models to zoom in on a conversation in a crowded room.
offmute-v2 is awesome: it's more accurate, better formatted, and cheaper than offmute. Here's a non-vibe analysis and comparison - with no detail glossed over.
By way of receipts, offmute-v2 has two versions - offmute-v2@glm and offmute-v2@opus. I'll combine and update offmute-v2@latest to be the one I'm daily-driving and improving (which is the glm version).
The open-source repo has two branches for glm and opus, both with my manual review, the LLM progress logs, and all the receipts for the analysis here. If you enjoy reading (a dying skill today) - this is the place to be.
It's not all roses, though. I'll cover some caveats later on - but it's hard to emphasize how strong of a watershed moment this is.
Both agents going at it - GLM on the left, Opus on the right, both in Claude Code.
Benchmarks have been useless for quite some time now, not just because contamination, Goodhart's law and number-maxxing have made them untrustworthy. Even when the results are good, the strange nature of intelligence makes it hard to understand what a 10 percent improvement actually means. It definitely does not mean 10 percent better.
Let's dive in.
The Task
It's now thoroughly cooked as a take-home, so I can divulge all the things we were looking for.
There are three repos - all functional, all tested and working, all in the same language - that contain working, base functionality you need to combine.
offmute implements speaker-labelled transcription using multimodal LLMs. It's a little old but it's something I use every day. For properly diarizing speech (through interruptions) and capturing tone and visual information (on videos) to save meetings, I believe it's still state-of-the-art.
Two problems: it doesn't do proper timestamps (LLMs are really bad at this), and the chunking means that sometimes speakers get mixed around (or the formatting breaks).
meeting-diary is mostly an AssemblyAI (and others) wrapper that uses their transcription and diarization model, and then uses you to identify which speakers are whom and retcons it into the transcript. Solid timestamp alignment, but it's manual, and offmute has it beat on transcription quality.
ipgu is a subtitle translation system that can take in an English subtitle, a full video file, and use iterative matching to get proper translations direct from video, align those to the subtitles, and produce a combined version. It's really great, but it needs that base subtitle for timestamping.
offmute (left) and ipgu (right) - two of the three projects offmute-v2 is built on.
Do you see where I'm going with this? The weaknesses of each cover the strengths of the other one - and they've all had time to mature and stabilize, so the techniques are robust.
We can build a better version.
Except - it blows up if you try to one-shot it. This is intentional.
With the sheer volume of applicants we get sometimes, a good take-home needs to have a few properties. For one, it should be of a good difficulty, without being too hard or too arduous - ideally it should be fun.
For another, given the rise of agentic coding, it needs to reward proper use of AI and punish improper use. The ideal take-home or project is one that:
can be done without AI,
but significantly accelerated with appropriate agent use
and massively hindered if you use agents wrong.
I think we succeeded with this one - even though we knew any and all success would be short-lived.
If you vibe-code this project, a number of strange smells show up in the code that are incredibly easy to catch:
The prompts (for diarization, speaker labelling, alignment, etc.) get vibed in by LLMs instead of being transported and applied from the source repositories. Prompting for these tasks is non-trivial - which means you'll consistently get broken pipelines, "looks right but smells wrong" transcription outputs, wrong speakers, etc.
The success of this whole process (the learning of ipgu) is that structured data and formats are crucial to this whole endeavour. You need to figure out how to reliably extract speakers, timings, and other information from models, and match them word-for-word appropriately. Agents will vibe the whole thing.
Processing audio is difficult to do well, and in a cross-platform way. Processing video on top of that (like offmute does) is even harder. Agents left unchecked will install a whole bunch of dependencies, change their minds, leave them in, until your entire pipeline looks like a giant mess of ffmpeg, wasm, in-memory processing, fluent-ffmpeg, etc.
Most importantly, if you vibe this take-home, agents (anything Opus-level) will give you a project that looks right but will fail if you give it anything complex - which you will then submit, and disqualify yourself. If you vibed the whole thing, your agents probably downloaded some simple audio snippets (barely past 10 minutes), ran them through, and called success.
This is the same instinct behind Mitchell Hashimoto seeding his AGENTS.md and code comments with prompt injections... to catch contributors who sling unreviewed AI output across a human boundary - the take-home's smells catch the same thing.
This task has served us well for a long time now, both as a take-home and as an internal benchmark on how LLMs perform. For the benchmarking, we have a more extended prompt (that Opus 4.5 still fails) because we're testing models, not humans + models, internally. The test is how far and how well the model (and harness) can execute, so we remove almost all of the human smell-tests from the prompt.
Costs
Just leaving this here:
Opus 4.8claude-opus-4-8
286.6M
GLM-5.2glm-5.2[1m]
209.0M
cache readcache writeoutputinput
Almost all of it is cache reads - that’s what hours of agentic work look like. GLM actually moved fewer tokens overall.
Experiment Setup
The template folder - prompt, inspirations, and test files - cloned for each model. Everything else is empty.
The set up is pretty simple. We provide a folder (cloned for both opus and glm) with a single prompt file and two recordings to build and test against - a talk I gave (noisy) and the No Priors podcast with Satya Nadella... (tons of speakers) - plus some hand-checked transcripts to measure against (decent, but not ground truth). The rest of the folders are empty.
Both GLM and Opus are run in Claude Code - the best performing harness for GLM being Claude Code... - which should eliminate any harness-level variance. The starting instruction is the prompt file - nothing else.
Once both builds were done, I tested them on recordings they'd never seen - chiefly a 30-minute SuperAI panel with a roomful of speakers, plus a handful of other meetings - to see how they held up blind. That's the file you'll see in the transcripts just below.
Results
If you want to hear it from them, here are the process logs from Opus and GLM:
process log - Opus
process log - GLM
(If you're wondering why there's so much mention of NTU in those logs: one of the two test recordings is a talk I gave recently - "How not to leave Greenfield", at SQ Collective in Singapore.)
On first use, the Opus and GLM versions had a single noticeable bug. The Opus version failed on audio files (it expected video and keyframes), and the GLM version would default to an intermediates directory with mis-caching, which meant that if you didn't specify your own, you would get the last cached output for any transcription. Both pretty bad, both easy enough to fix. The GLM output as a whole looked and functioned better.
How it works
Both builds work, and both landed on the same core idea: let a multimodal LLM own the content (verbatim text, tone, who's speaking, even through interruptions), let an ASR model own the clock (word-accurate timestamps), and fuse them with a single token-alignment pass.
The differences are all in the seams: which bugs showed up on first contact, how readable and conventional the code is, and how real the "runs in the browser" claim is.
The crux: lining up two transcripts
The hard part - the thing the whole tool rests on - is the alignment. The LLM gives you beautiful diarized text with tone, but its timestamps drift by minutes. The ASR gives you word-perfect timings but messy, speaker-blind text (it kept the "um like you know" the LLM cleaned out). So you line the two word streams up against each other and read the ASR's clock onto the LLM's words.
The fun part: both models, working alone, reached for the exact same tool - a global Needleman-Wunsch alignment... - the classic dynamic-programming way to match two sequences - run over the two token streams, O(n·m), one matrix per chunk, with ties broken toward exact matches so repeated words don't scatter to the wrong occurrence. Down to the cost weights, the two implementations are siblings.
Here's an interactive demo of how the algorithm matches words from two different transcripts, with and without the optimizations.
width ±1
ASR →
ε
so
um
like
you
know
help
me
get
it
right
ε
0
·
·
·
·
·
·
·
·
·
so
·
·
·
·
·
·
·
help
·
·
·
·
·
·
·
me
·
·
·
·
·
·
·
·
get
·
·
·
·
·
·
·
it
·
·
·
·
·
·
·
right
·
·
·
·
·
·
·
·
·
cells computed: 23 / 77 (banded ±1) · memory: 77 cells (full matrix) · score: -1 (full DP scores 8)
⚠ The optimal path leaves the band — banding returns a different, worse alignment here. Widen the band (or turn it off) to recover the true path.
With one tell. Opus's align.ts docstring claimed it used a "Hirschberg-free banded variant" to bound cost on large inputs - except there's no banding anywhere in the code; it's plain full-matrix DP. The optimization had been considered and written into the comment, but never into the code. Its own automated review caught it ("there is no banding"), I flagged it on the read, and it got rewritten to describe what's actually there. The code was honest; the comment wasn't - exactly the kind of thing you only catch by reading. (Turn banding on with a narrow width in the demo above and you can watch the true path slip outside the band - that's the bug the comment would have shipped.)
The READMEs, side by side
Even the way each one introduces its own work tells you something. Here they are next to each other - GLM's README on the left, Opus's README on the right:
GLM (left) is terse and conventional - a results table, an architecture tree, plain usage. Opus (right) is warmer and more marketed - emoji in the title and a "Why it's accurate" WHEN/WHO/HOW table.
The transcripts, side by side
Here are both builds on the same stretch of that unseen panel - the opening round, where the moderator asks the room to raise hands and the panelists introduce themselves. The brief said one thing each block can't be is too thick to read in one go - and that's where they diverge.
The panel in question is the SuperAI "100x Company" panel:
The SuperAI "100x Company" panel - five speakers on stage, an unseen 30-minute file.
GLM
[truncated for AI cost control]