2026-05-26 05:32 UTCIn-site rewrite1 min readUpdated: 2026-06-30 13:03 UTC

Alibaba's Qwen3.7-Max Ranks Second Globally in Coding Benchmark, Trailing Only Claude

Alibaba's latest flagship model Qwen3.7-Max achieved a score of 1541 on the authoritative Code Arena leaderboard, surpassing GPT-5.5 and other models, ranking second globally behind the Claude series.

Source量子位Author: 量子位的朋友们

Alibaba's latest flagship AI model, Qwen3.7-Max, has secured the second position globally on the Code Arena leaderboard, a prestigious third-party coding benchmark. Released on May 26, 2026, the model scored 1541, outperforming major competitors such as GPT-5.5, Gemini-3.5-Flash, GLM-5.1, and Kimi-K2.6. It now trails only the Claude series from Anthropic.

Code Arena, hosted by the renowned blind-test platform LMArena, is considered one of the most credible evaluations of AI coding ability. Unlike traditional benchmarks that focus on isolated code snippets or algorithm problems, Code Arena requires models to generate complete, interactive web applications from scratch based on developer-created challenges. These anonymous outputs are then pitted against each other in user votes, resulting in a comprehensive ranking. Qwen3.7-Max is the first Chinese model to break the 1540-point barrier and the only one in the top four to do so, breaking a long-standing dominance by Claude models.

Designed specifically for agentic tasks, Qwen3.7-Max demonstrates significant breakthroughs in coding, agent reasoning, and long-horizon task execution. It can independently complete complex end-to-end projects that would typically take a professional team two weeks, in just a few hours. Additionally, it can sustain continuous operation for up to 35 hours, executing over 1,000 tool calls, and even self-optimizing chip kernels through iterative programming.

The model has received widespread acclaim from developers and AI creators. Early adopters praised its "impressive long-horizon autonomous execution ability" and described it as "a true agent foundation model that gets things done." Independent evaluations by AI labs comparing Qwen3.7-Max with Claude-4.7 and GPT-5.5 under identical prompts found that the Alibaba model offered the largest performance improvement over its predecessor, the lowest inference cost, and clear advantages in both output speed and generation quality.

This milestone underscores Alibaba's rapid progress in AI code generation and positions Qwen3.7-Max as a formidable contender in the global AI landscape.