Claude's Pass Rate Below 4%: SaaS-Bench Shatters the 'Fully Automated Office' Illusion of Computer-Use
UniPat AI releases SaaS-Bench, a benchmark evaluating mainstream large models on real office tasks. The highest full pass rate is only 3.8%, revealing that AI-powered fully automated offices are far from reality.
Article intelligence
Key points
- SaaS-Bench evaluation shows the best model, Claude Opus 4.7, achieves a full pass rate of only 3.8%.
- 93.4% of tasks span at least two applications, and 97.3% of text tasks involve over 100 steps.
- Four structural failure modes: declining accuracy with longer tasks, cascading errors from single mistakes, lack of verification loops, and highly inconsistent execution.
- The current agent paradigm has fundamental limitations in long-horizon tasks, and software may need to be redesigned for agents.
Why it matters
This matters because saaS-Bench evaluation shows the best model, Claude Opus 4.7, achieves a full pass rate of only 3.8%.
Technical impact
May affect model selection, inference cost, product capability, and evaluation benchmarks.
UniPat AI has unveiled SaaS-Bench, a rigorous benchmark designed to evaluate the real-world performance of AI agents on office tasks. The results are sobering: even the most capable model, Claude Opus 4.7, managed a full pass rate of just 3.8% across 106 tasks. Other models like Kimi K2.5 and Gemini 3.1 Pro scored zero percent. The benchmark exposes a vast gap between the hype surrounding "computer-use" agents and their actual ability to complete complex, multi-step workflows.
SaaS-Bench simulates authentic office environments using 23 open-source SaaS applications deployed via Docker, complete with real front-end and back-end logic, database states, and business constraints. The 106 tasks cover six domains including software development, finance, healthcare, and team collaboration. A staggering 93.4% of tasks require actions across at least two applications, and 97.3% of text-based tasks involve more than 100 steps, with some trajectories exceeding 300 steps. This makes SaaS-Bench a far cry from simpler benchmarks that use simulated environments and short tasks.
The benchmark's strict evaluation uses two metrics: the Checkpoint Score (partial credit for completing sub-steps) and the Resolved Score (full pass only if all checkpoints are met). Claude Opus 4.7 achieved a Checkpoint Score of 43.9% but a Resolved Score of just 3.8%, meaning it completed only 4 out of 106 tasks end-to-end. The discrepancy highlights that agents can make progress but consistently fail to finish entire workflows.
SaaS-Bench identifies four structural failure modes. First, accuracy degrades as tasks lengthen: even with a 95% per-checkpoint success rate, the probability of passing 12 checkpoints drops to 54%. Second, a single early error cascades into downstream failures; for example, a 3% weight error in creating a client led to a 30% loss later. Third, agents often fail to verify their work, believing they succeeded when the system state shows otherwise. Fourth, execution is highly inconsistent; Claude Sonnet 4.6 scored anywhere from 0.00 to 0.68 on the same task across three runs due to path dependence.
These findings suggest that current agents lack persistent state reasoning, closed-loop verification, and error recovery capabilities. The authors argue that these are not just engineering issues but fundamental limitations of the current agent paradigm. Furthermore, they predict that software designed for human users may need to be redesigned for AI agents, as today's interfaces are optimized for human eyes and hands rather than automated processes. The full blog, code, and paper are available on UniPat AI's GitHub and arXiv.