Researchers from Penn State University and Duke University, in collaboration with Google DeepMind and others, introduce the problem of automated failure attribution in LLM Multi-Agent systems. They present the Who&When benchmark dataset and evaluate methods like All-at-Once, Step-by-Step, and Binary Search. Their work, accepted as a Spotlight at ICML 2025, aims to help developers quickly pinpoint which agent caused a failure and at what step. Current methods achieve only up to 53.5% accuracy in identifying the responsible agent and 14.2% in locating the error step.
First formalization of automated failure attribution for LLM Multi-Agent systems.
Who&When dataset includes 127 failure logs with fine-grained annotations of responsible agent and error step.
MIT introduces SEAL, a framework enabling large language models to self-edit and update their weights via reinforcement learning, marking significant progress toward self-evolving AI.
SEAL allows LLMs to generate self-edits via reinforcement learning and update their weights
Demonstrates substantial performance gains in few-shot learning and knowledge integration
Automated failure attribution is formalized as a new research task for LLM multi-agent systems. The Who&When benchmark dataset with fine-grained annotations is introduced. Three attribution methods are evaluated, achieving at most 53.5% accuracy in identifying the responsible agent and 14.2% for the exact error step, highlighting the difficulty. The paper is accepted as a Spotlight at ICML 2025.
First formalization of automated failure attribution in multi-agent systems.
Who&When dataset of 127 system failure logs with detailed human annotations.
Researchers from Adobe Research, Stanford University, and Princeton University propose a novel architecture combining State-Space Models (SSMs) and dense local attention to overcome the long-standing challenge of long-term memory in video generation. Using block-wise SSM scanning, diffusion forcing, and frame local attention, the model achieves superior performance on Memory Maze and Minecraft datasets while maintaining computational efficiency, enabling interactive applications.
Proposes Long-Context State-Space Video World Model (LSSVWM) that combines SSMs for long-range memory with local attention for spatial coherence.
Introduces block-wise SSM scanning scheme to extend temporal memory while balancing computational cost.
A new 14-page technical paper from DeepSeek-V3 team, co-authored by CEO Wenfeng Liang, explores hardware-aware model co-design to overcome scaling challenges. It details innovations like Multi-head Latent Attention (MLA), DeepSeekMoE, FP8 training, and node-aware routing to achieve cost-efficient large-scale training and inference.
DeepSeek-V3's technical paper reveals hardware-aware co-design strategies for low-cost LLM training.
Key innovations include MLA for memory efficiency, DeepSeekMoE for sparse computation, and FP8 mixed-precision training.
DeepSeek AI releases DeepSeek-Prover-V2, an open-source LLM for Lean 4 theorem proving. It uses recursive proof search with DeepSeek-V3 for training data and reinforcement learning, achieving top results on MiniF2F.
DeepSeek-Prover-V2 uses recursive proof search pipeline with DeepSeek-V3 to generate cold-start training data.
Achieves 88.9% pass ratio on MiniF2F-test and solves 49 problems from PutnamBench.
Kwai AI's SRPO framework slashes LLM RL post-training steps by 90% while matching DeepSeek-R1 performance in math and code. This two-stage RL approach with history resampling overcomes GRPO limitations.
SRPO addresses cross-domain optimization conflicts between math and code via two-stage training.
History resampling improves gradient signal quality and prevents training stagnation.
Chinese AI company Zhipu.AI open-sources its next-generation GLM model series, including the GLM-Z1 inference model with speeds up to 8x faster than DeepSeek-R1, the GLM-Z1-Rumination reasoning model, and agent-enhanced GLM-4 models. It also launches Z.ai international platform and offers enterprise MaaS services. This move showcases technical prowess and global ambitions, potentially paving the way for an IPO.
Open-sources GLM-Z1 inference model achieving 200 tokens/s on consumer GPUs, 8x faster than DeepSeek-R1
Launches Rumination model for autonomous AI agents with internet search, analysis, and self-verification
DeepSeek AI has published a research paper detailing a new technique to enhance the scalability of general reward models during inference, while hinting at the imminent arrival of its next-generation model, R2.
DeepSeek introduces Self-Principled Critique Tuning (SPCT) to improve inference-time scaling of general reward models.
SPCT uses rejection fine-tuning and rule-based online RL to dynamically generate principles and critiques.