AI News HubLIVE
站内改写

Introducing DSA Attention to Multimodal: Kuaishou Keye 2.0 Opens a New Paradigm of Enhanced Reasoning

Kuaishou releases Keye-VL-2.0-30B-A3B, a multimodal large language model that first applies DeepSeek Sparse Attention (DSA) to multimodal scenarios, enabling 256K ultra-long context deep perception. It achieves SOTA on long-video temporal understanding benchmarks and introduces built-in Agent collaboration, paving the way for enhanced reasoning and real-world business applications.

Article intelligence

EngineersAdvanced

Key points

  • First to integrate DSA attention into multimodal, solving long-video understanding bottlenecks.
  • Achieves SOTA on TimeLens, LongVideoBench, MLVU; reverses long-context decay by boosting accuracy from 35.34% to 42.44% when scaling from 64 to 512 frames.
  • Debuts Agent mechanism with Code, Tool, and Search capabilities, demonstrating robust multi-step task planning.
  • Employs MOPD and Context-RL to overcome catastrophic forgetting and enhance reasoning reliability.

Why it matters

This matters because first to integrate DSA attention into multimodal, solving long-video understanding bottlenecks.

Technical impact

May affect model selection, inference cost, product capability, and evaluation benchmarks.

Kuaishou has officially launched Keye-VL-2.0-30B-A3B, the latest generation of its multimodal large language model. This model marks a significant milestone by being the first to incorporate DeepSeek Sparse Attention (DSA) into multimodal understanding, enabling unprecedented handling of ultra-long video contexts up to 256K tokens. The DSA mechanism, combined with carefully designed feature aggregation, allows the model to efficiently extract key information from hours-long video sequences while drastically reducing computational overhead. Specifically, the prefill cost is reduced by 50%, and the decode complexity scales gracefully with context length, overcoming the exponential explosion seen with traditional full attention.

The model’s performance on temporal understanding benchmarks is nothing short of remarkable. On the TimeLens benchmark, it achieved a mIoU of 58.4 on Charades, approaching the closed-source Gemini 3 Flash (61.2), and surpassed Gemini 2.5 Pro (58.1) and Gemini 3 Flash (57.0) on ActivityNet. On QVHighlights, it reached 70.1 mIoU, far exceeding Gemini 3 Flash’s 49.5. More importantly, the model defied the common 'long-context decay' phenomenon: when expanding input frames from 64 to 512, average accuracy on VideoMME V2 rose from 35.34% to 42.44%, and the non-linear score increased from 18.54 to 24.19. This demonstrates true mastery of lengthy video sequences.

Real-world examples further showcase the model’s abilities. Given a 9-minute Iceland travel vlog, it not only identified visual details like 'cold hands' but also inferred causal relationships: it recommended gloves due to ‘freezing hands,’ offered culturally sensitive advice about local food, and suggested guided tours over self-driving after noticing a snow car accident. In a pottery-making video, it produced a detailed, timestamped breakdown of over a dozen craft steps. For a game highlight video, it identified the ‘clutch’ moment by analyzing visual intensity, audio-visual synergy, and narrative context, even comparing it with earlier segments to justify its choice. Such capabilities go well beyond simple scene tagging; they reflect deep temporal causal reasoning.

Equally groundbreaking is the model’s built-in Agent framework, which unifies perception, planning, and execution. The Code Agent demonstrates strong logical reasoning, scoring 77.10 on LiveCodeBench v6 and 62.00 on SWE-bench Verified. The Tool Agent excels at multi-step API calls, achieving 82.58 on TAU2-Bench and showing robust fault tolerance in complex workflows. For example, when asked to handle a multi-threaded request involving store lookups, distance calculations, and order creation, the model autonomously planned and executed over a dozen API calls with self-correction.

To enable seamless multi-task learning without catastrophic forgetting, Kuaishou introduced cross-modal MOPD (Multi-Expert Policy Distillation/Merging). This technique uses dynamic routing and parameter fusion to integrate expert models, along with a novel ‘Bucket Advantage Scaling’ method that amplifies core reasoning signals while suppressing template noise. Additionally, a Context-RL reward mechanism provides dense, fine-grained supervision to reduce hallucination in multi-step reasoning, especially in math, medical, and coding domains. A strict data engine with accuracy filtering ensures high-quality training trajectories.

The impact of Keye-VL-2.0-30B-A3B extends beyond benchmarks. Kuaishou is deploying the model across its core business, including generative recommendations, content moderation, and targeted advertising, where it has already improved distribution accuracy and commercial returns. By integrating Video × Agent, the platform aims to revolutionize content production, enabling automatic highlight extraction, editing, and marketing copy generation. With its 30B-parameter efficiency and open-source release on Hugging Face and GitHub, Keye-VL signals a new era where research-grade multimodal understanding translates directly into tangible business value.