AI News HubLIVE
站内改写

Alibaba's Qwen3.7-Max Ranks Second Globally in Coding Benchmark, Trailing Only Claude

Alibaba's latest flagship model Qwen3.7-Max achieved a score of 1541 on the authoritative Code Arena leaderboard, surpassing GPT-5.5 and other models, ranking second globally behind the Claude series.

Article intelligence

EngineersAdvanced

Key points

  • Qwen3.7-Max scored 1541 on Code Arena, ranking second only to Claude.
  • Code Arena is a blind-test platform where developers submit full web app challenges.
  • The model excels in long-horizon tasks, capable of running 35 hours with over 1000 tool calls.

Why it matters

This matters because qwen3.7-Max scored 1541 on Code Arena, ranking second only to Claude.

Technical impact

May affect model selection, inference cost, product capability, and evaluation benchmarks.

Alibaba's latest flagship AI model, Qwen3.7-Max, has secured the second position globally on the Code Arena leaderboard, a prestigious third-party coding benchmark. Released on May 26, 2026, the model scored 1541, outperforming major competitors such as GPT-5.5, Gemini-3.5-Flash, GLM-5.1, and Kimi-K2.6. It now trails only the Claude series from Anthropic.

Code Arena, hosted by the renowned blind-test platform LMArena, is considered one of the most credible evaluations of AI coding ability. Unlike traditional benchmarks that focus on isolated code snippets or algorithm problems, Code Arena requires models to generate complete, interactive web applications from scratch based on developer-created challenges. These anonymous outputs are then pitted against each other in user votes, resulting in a comprehensive ranking. Qwen3.7-Max is the first Chinese model to break the 1540-point barrier and the only one in the top four to do so, breaking a long-standing dominance by Claude models.

Designed specifically for agentic tasks, Qwen3.7-Max demonstrates significant breakthroughs in coding, agent reasoning, and long-horizon task execution. It can independently complete complex end-to-end projects that would typically take a professional team two weeks, in just a few hours. Additionally, it can sustain continuous operation for up to 35 hours, executing over 1,000 tool calls, and even self-optimizing chip kernels through iterative programming.

The model has received widespread acclaim from developers and AI creators. Early adopters praised its "impressive long-horizon autonomous execution ability" and described it as "a true agent foundation model that gets things done." Independent evaluations by AI labs comparing Qwen3.7-Max with Claude-4.7 and GPT-5.5 under identical prompts found that the Alibaba model offered the largest performance improvement over its predecessor, the lowest inference cost, and clear advantages in both output speed and generation quality.

This milestone underscores Alibaba's rapid progress in AI code generation and positions Qwen3.7-Max as a formidable contender in the global AI landscape.