MiniMax M3 vs. GLM 5.2: Codegen comparison across autonomous coding tasks
In the Thinkbench benchmark, GLM 5.2 led in correctness (92% full-pass) while MiniMax M3 was cheaper and faster. They performed similarly on code modification tasks, but GLM was steadier on greenfield builds. MiniMax tended to build more complete systems with ambiguous prompts.
DATARESULTS DATAEVAL SUITE
Thinkbench, our custom evaluation harness, was used to drive both models through the same autonomous coding loop: read files, write files, run shell commands, and stop when the task was complete. The scored suite covered greenfield builds, bug fixes, feature additions, and repair-to-green tasks. Hidden graders ran after each model stopped, using fixed-denominator behavior checks. We also included a separate ungraded dimension covering how the models handle ambiguously defined instructions, with briefs for systems such as audit logs, schedulers, feature flags, and notification hubs. For those tasks, we tracked implementation choices, API shape, scope control, and failure semantics.
72total tasks
60hidden-graded
12observed only
432final rows
- Results
GLM was steadier. MiniMax was cheaper and faster.
Across the 60 scored tasks, GLM 5.2 finished with the stronger correctness profile: 92% full-pass and a 0.976 mean score. MiniMax M3 finished at 84% full-pass and a 0.961 mean score. Because the separation was modest, cost and latency will probably influence which model makes sense. MiniMax cost $6.67 for the scored runs against GLM's $18.47, and averaged 45 seconds per run against GLM's 80 seconds.
Overall scored result
60 tasks, 3 trials per model; costs are cache-aware final scored rows.
Reading it: GLM has the better correctness numbers; MiniMax has the lower operating cost and lower latency.
Overall scored rows
ModelFull-passMean scoreAvg latencyAvg tokensTotal cost
GLM 5.2165 / 180 (92%)0.97680s82,443$18.47
MiniMax M3152 / 180 (84%)0.96145s135,060$6.67
Both models could code. What seemed to matter more was where they started to need review: package shape, edge cases, API design, or judgment calls in a loose brief. On existing-code work, the models were almost indistinguishable: bug fixes, feature additions, and repair-to-green tasks all landed at 0.999 to 1.000 mean score. The hard part was building from an empty repo.
Full-pass rate by task type
The greenfield lane is where the benchmark actually separated them.
Reading it: GLM's edge is concentrated in implement tasks. Both models effectively saturated the modification tasks.
- Separation
Differences were concentrated in greenfield builds.
Per task, 54 of 60 scored tasks landed within 0.1 mean score of each other. Greenfield builds accounted for all six larger gaps. It means the benchmark did not show one model broadly failing and the other broadly succeeding. It showed a narrow difference in from-scratch packaging, API design, and edge-case discipline.
Largest task gaps
Mean score over three trials; only the largest gaps are shown.
Reading it: GLM's biggest win was ticketflow; MiniMax's clearest wins were patchwise and migrato.
Where GLM looked stronger
ticketflow was the largest gap: GLM went 1.00 while MiniMax averaged 0.33. The issue was not that MiniMax could not reason through the assignment. Two trials used a package layout the grader could not import from the workspace root, so the hidden checks never reached the logic. GLM delivered the package shape consistently.
Where MiniMax pushed back
patchwise went the other way: MiniMax scored 1.00 while GLM averaged 0.62. GLM delivered code, but its failures were real implementation bugs: a name typo in one trial and trailing-newline diff handling in two others. MiniMax handled that fixture cleanly.
microapi was the second clear GLM win. MiniMax built a plausible framework, but missed a short-circuit middleware return in one variant and produced a bad route regex in another. migrato was a MiniMax win: it cleared the migration-contract checks where GLM lost points. The pattern I take from this: GLM was steadier at packaging and complete delivery; MiniMax could still beat it on individual hard builds.
- Ambiguity
When tasks were vague, MiniMax filled in more system.
The observed tasks were different. They were deliberately vague, and I didn't score them because there was no single correct implementation. The point was to compare what each model decided to build when the prompt left room. In that phase, MiniMax M3 consistently added more production-shaped machinery. GLM 5.2 more often stayed closer to the plain reading of the brief.
auditlog
MiniMax added hash-chain verification, a query builder, an action decorator, and file permission hardening. GLM stayed flatter: hash chain, boolean verification, and direct query filters.
notifyhub
MiniMax built a notifier with priority fallback and a hard failure when no channel worked. GLM collected per-channel send results and returned a report.
formvalidate
MiniMax built conditional validation with a when(...) combinator. GLM explicitly declined that extra scope and kept the validator simpler.
scheduler
MiniMax chose a fixed one-second background tick. GLM used a heap and slept until the next scheduled event, which was more idle-efficient.
I'm comfortable calling MiniMax the more eager model in this set because that claim is backed by the artifacts, not by vibe. It repeatedly reached for locks, persistence, policy objects, fallback paths, decorators, and extensible strategy shapes. That can be useful. It can also be too much. GLM's restraint is not automatically better either; sometimes it missed a useful abstraction that MiniMax supplied.
Observed tasks are evidence, not scores
TaskOpen decisionMiniMax M3 tendencyGLM 5.2 tendency
jobflowDependency modelObject references, DFS cycle detection, locks, CLIString references, Kahn topo-sort, bare module entry
ratelimitAlgorithm extrasToken bucket plus variable request cost and resetToken bucket with one-token consumption
featureflagsTargeting architectureNested strategy hierarchyInline rules inside one evaluator
docsearchIndexing shapeTF-IDF with disk-backed JSON CLITF-IDF with lazy posting-list cache
- Method
How the benchmark was run.
Each model got the same task brief, and for non-greenfield tasks the same starter code. It worked through a read/write/run loop, then stopped. Only after that did the hidden grader get copied into the workspace. A scored run used a fixed denominator, so an unimportable package scored zero instead of quietly shrinking the number of checks.
Same workBoth models saw the same brief and starter files for each task.
Three trialsEvery scored task ran three times per model, with thinking disabled.
Hidden graderThe grader was copied in only after the model stopped editing.
Ambiguous specVague briefs were preserved and read, not forced into pass/fail scoring.
Client configuration
ParameterMiniMax M3GLM 5.2
ProviderFireworks AIFireworks AI
Serving pathServerless endpointServerless endpoint
API shapeOpenAI-compatible chat completionsOpenAI-compatible chat completions
Endpointhttps://api.fireworks.ai/inference/v1https://api.fireworks.ai/inference/v1
Model IDaccounts/fireworks/models/minimax-m3accounts/fireworks/models/glm-5p2
Service tierprioritypriority
Thinking modenonenone
Trials per task33
Input price$0.45 / Mtok$2.10 / Mtok
Cached input price$0.09 / Mtok$0.39 / Mtok
Output price$1.80 / Mtok$6.60 / Mtok
Cached input share73.36%73.06%
Avg scored run latency45.0s79.6s
Latency is end-to-end scored-run wall-clock time, not provider network timing.
Those client settings matter for interpreting the cost numbers. Cost was cache-aware because autonomous coding loops resend conversation context; cached input was priced separately from uncached input and output. The total production spend for building the benchmark and results was $48.88 across 605 metered runs. The scored table above is narrower: it reports only the final scored rows.
- Assessment
Where I'd use each model.
GLM 5.2 is the safer pick when the task is a hard from-scratch build and the result needs to arrive as a complete, runnable project. Its greenfield edge was the clearest difference in the scored set. It cost more and took longer, but it produced more full-pass runs where the task started from nothing.
MiniMax M3 is the value pick for a lot of worker traffic. It was much cheaper, faster, and effectively tied on existing-code tasks. If the work is a bug fix, feature addition, or repair-to-green loop under review, MiniMax looks strong enough to be the default worker.
I wouldn't make either one the top-level coordinator by default. The best shape is still a frontier coordinator or judge above them: GPT-5.5 or Claude Opus deciding what to delegate, checking the finished work, and rerunning narrow pieces when the answer looks wrong. These models make the worker layer much more serious, not the coordinator layer unnecessary.
Sources: Thinkbench evaluation harness, downloadable result bundle, and downloadable evaluation suite. Runner configuration: MiniMax M3 and GLM 5.2 on Fireworks AI serverless endpoints, priority tier, thinking disabled, three trials per task.