A classical intuition holds that verifying a solution is easier than producing one. For today's coding agents, this intuition is inverted: generating complex candidate solutions is no longer difficult, but reliably verifying them has become the harder problem. Every verifier is a proxy for human intent, never the intent itself, leading to a twofold difficulty: underspecified intent and optimization widening the proxy–intent gap. The paper characterizes verification signals along scalability, faithfulness, and robustness, and studies four reward constructions. Experiments show targeted verification design can suppress reward hacking and improve task completion quality, with the core observation that no fixed reward function remains effective as policy capability grows; verification must co-evolve with the generator.
Verification is now harder than generation for coding agents; all verifiers are proxies for human intent.
Two fundamental challenges: inherently underspecified intent and optimization-induced proxy–intent divergence.
COrigami is an end-to-end AI-driven pipeline that generates crease patterns from natural language, satisfying strict flat-foldability constraints and visual aesthetics. It assists human artists by generating structural starting points through steps including semantic stick figure generation, base packing, crease pattern solving, shaping, and reinforcement learning with an autonomous aesthetic evaluation loop.
COrigami converts natural language into crease patterns that satisfy flat-foldability constraints.
The pipeline includes semantic stick figure generation, base packing, crease pattern solving, shaping, and reinforcement learning.
This paper proposes a governance model for autonomous AI agents that does not monitor their reasoning but requires independently attested evidence at the point of high-risk actions. The agent retains autonomy over planning and reasoning, but execution of designated high-risk actions is conditional on preconditions attested by separate authoritative sources, cryptographically bound to a declared intent, and evaluated by a deterministic policy. Decisions are recorded in a tamper-evident log. A proof-of-concept implementation is presented with examples from software deployment and clinical prescribing.
Autonomous AI agents may perform consequential, irreversible actions like clinical prescribing or software deployment.
Proposed model: agents retain autonomy but have no execution authority over high-risk actions; execution requires independently attested preconditions.
Researchers propose DD-Elo, a skill assessment framework using drift diffusion model and move-level data, achieving faster adaptation than Elo while maintaining bounded deviation.
DD-Elo integrates move-level data to capture rapid skill fluctuations
Rigorous mathematical proof shows bounded deviation from Elo
This study develops a provenance-aware, knowledge-graph-based multi-agent framework that integrates Reddit posts, WebMD reviews, and FDA adverse event records for nine antidepressants, achieving high entity recognition accuracy and revealing that patient-generated data provide partly independent safety signals, with community sources often preceding regulatory reports.
Framework unifies 466,525 Reddit posts, 60,782 WebMD reviews, and 20 years of FDA data for nine antidepressants.
LLM entity recognition pipeline achieves F1 scores of 0.969 for medications and 0.973 for conditions.
This paper introduces an LLM-powered comparative pipeline for large-scale governance discourse analysis of AI agent protocols. It validates the pipeline on two contrasting standards: ERC-8004 (permissionless, on-chain) and Google A2A (corporate-led). Analyzing 4,323 governance participation records, it finds that while governance form influences substantive focus, both regimes exhibit comparable participation inequality and community fragmentation. However, discourse alignment is denser in the permissionless setting, suggesting open governance may foster greater thematic convergence.
Introduces an LLM-powered pipeline combining automated annotation, neural topic modeling, and multi-layer network analysis
Compares ERC-8004 (permissionless) and Google A2A (corporate-led) governance structures
AlgoEvolve is an LLM-driven evolutionary framework that generates, evaluates, and iteratively improves executable trading strategies. Across multiple experiments, the system exhibits emergent regime-adaptive strategy logic and introduces a meta-evolutionary outer loop to evolve prompts, improving search heuristics. The results demonstrate that LLM-based semantic evolution provides a viable approach for continual program synthesis in complex environments.
AlgoEvolve uses LLMs as semantic mutation operators for algorithmic trading
The system exhibits emergent regime-adaptive trading logic
This paper shows that refusal in instruction-tuned chat models is gated by a compliant persona direction. Interventions on Qwen2.5-7B-Instruct and Llama-3.1-8B-Instruct demonstrate that steering compliant persona suppresses refusal (e.g., Llama's refusal rate drops from 97% to 2%), and refusal direction only partially restores refusal in late layers. The findings indicate refusal is expressed downstream of persona computation.
Compliant persona steering reduces refusal rates drastically (97% to 2% in Llama).
Refusal direction partially restores refusal only in late layers, not early ones.
When a benchmark's accuracy saturates, it is often retired. This paper shows that this approach misses six other key dimensions: construct validity issues, out-of-distribution generalizability, efficiency, reliability, model versus scaffold importance, and human-agent collaboration uplift. Using CORE-Bench Hard, they surface construct validity threats, introduce an improved benchmark v1.1 and OOD suite, and find that the benchmark remains useful for measuring efficiency, reliability, and performance. A small-scale experiment shows human-agent collaboration yields about a 2x speedup.
Saturated benchmarks can still evaluate efficiency, reliability, and generalizability beyond accuracy.
CORE-Bench Hard has construct validity issues that were hard to anticipate with weaker agents.
Researchers propose a method to detect and control sycophancy in language models using cascading linear features. Their approach uses iterative data generation to isolate features that scale linearly with behavior, enabling better disentanglement. The discovered features form linearly separable subspaces, allowing for detection and steering away from sycophancy, outperforming baseline methods with lower computational cost.
Sycophancy is the tendency of LLMs to prioritize user validation.
Cascading linear features method uses graded samples to isolate features.
This study constructs large-scale algorithm co-occurrence networks in NLP using deep learning on full-text papers. It analyzes network structure and centrality to assess collective influence over decades, finding that classic, high-performing, and cross-period algorithms dominate, and declining influence first loses core network position.
First large-scale algorithm co-occurrence network in NLP using full-text.
Algorithm networks show complex network features with increasing density.
A new method called SGPO improves LLM reasoning by replacing trajectory imitation with strategy distillation, achieving better generalization and outperforming baselines on math benchmarks.
SGPO distills reusable strategies instead of specific solution trajectories.
It uses a token-level forward-KL objective for selective distillation with proximal constraints.
A hybrid model using ensemble feature selection (ANOVA and mutual information) and Harris Hawks optimization-tuned logistic regression predicts mental health risk in female sex workers (FSWs). Achieved 95.78% accuracy on 3,005 FSWs, identifying post-traumatic stress, client violence, and occupational factors as key depression drivers. XAI enables early intervention and targeted care.
Recommender systems often induce filter bubbles and semantic homogenization by optimizing solely for immediate engagement. This paper introduces a multi-objective reinforcement learning framework that treats engagement, diversity, and fairness as distinct reward signals using a Pareto-DQN agent. Experiments on MovieLens show that hypervolume-based action selection disrupts feedback loops leading to semantic collapse, achieving societal gains with minimal engagement impact.
Single-objective recommenders cause filter bubbles and semantic collapse.
Pareto-DQN framework optimizes engagement, diversity, and fairness as separate rewards.
This paper investigates whether language model (LM) agents can assist in explaining circuit components after they have been localized in mechanistic interpretability. The authors introduce AgenticInterpBench, a benchmark of 84 semi-synthetic transformer circuits with 163 component-level annotations, and HyVE (Hypothesize, Validate, Explain), an agentic explainer that iteratively observes, hypothesizes, and causally validates. Experiments across four LM backbones show that HyVE recovers useful explanations, but no backbone is uniformly best; failures mainly occur in the validation step. A case study on an arithmetic circuit in Llama-3-8B demonstrates extension to naturally trained models. LM agents are promising but reliable validation remains a key obstacle.
LM agents can assist in circuit explanation in mechanistic interpretability.
HyVE agent uses iterative observation, hypothesis generation, and causal validation.
A new study shows that reinforcement learning on beneficial behavior in realistic domains can produce broad and persistent alignment generalization, with interventions limited to health improving non-health alignment evaluations and resistance to adversarial attacks.
Constructed a dataset of realistic situations to measure and train beneficial traits across diverse domains.
RL on beneficial behavior improved performance on over 80% of out-of-distribution benchmarks.
Researchers propose a hierarchical multi-agent reinforcement learning framework that enforces hard safety constraints via a constraint manifold at low level while enabling effective coordination through high-level policy learning. The approach provides theoretical safety guarantees, stationary learning dynamics, and achieves competitive performance with nearly perfect safety rates and strong generalization.
Existing methods face a trade-off between empirical performance and safety guarantees.
The new framework uses a constraint manifold to provide theoretical safety guarantees and stable learning.
This paper explores the nature of AI agents, distinguishing between 'agentic' systems with engineered workflows and 'agentive' systems with endogenous capabilities. It proposes the Goal-Identity-Configurator (GIC) architecture and emphasizes auditability, controllability, and safety of autonomous systems under human oversight.
Draws on Descartes and science fiction to analyze agent architectures along five dimensions: goal, identity, decision-making, self-regulation, and learning.
Distinguishes between 'agentic' systems (competence from engineered workflows) and 'agentive' systems (capabilities arise endogenously), where the latter represents true autonomy.
A new framework extracts rule-grounded reasoning traces from classical planners to supervise driving VLA models, ensuring structurally coupled reasoning and motion generation, with significant performance gains.
Driving VLA models with CoT reasoning often lack causal decision semantics
Neuro-Symbolic Drive uses internal decision traces from rule-based planners as supervision
RIFT-Bench is a graph representation-driven methodology for dynamic red-teaming that enables unified security evaluations across diverse agentic AI architectures. It operates in two automated phases—Discovery and Scanning—and supports evaluation of mitigation strategies, demonstrating effectiveness across 45 systems.
RIFT-Bench uses a hierarchical graph representation to unify security evaluation of heterogeneous agentic systems.
The pipeline has two automated phases: structure discovery and adaptive adversarial attack scanning.
This paper proposes a prompt-based uncertainty decomposition method that separates action confidence from request uncertainty, enabling LLM agents to ask for clarification when task specifications are ambiguous. The authors introduce two new benchmarks with 50% underspecified tasks and evaluate against existing methods across five LLMs, showing significant F1 improvements.
Classical aleatoric/epistemic uncertainty frameworks are insufficient for interactive LLM agents; underspecification-aware representations are needed.
A simple prompt-based decomposition separates action confidence from request uncertainty, enabling proactive clarification.
This paper introduces the Integral Transform Network (ITNet), a unified architecture that generalizes convolution, self-attention, and recurrence through a learnable integral kernel. ITNet matches or exceeds specialized models on multiple benchmarks.
Convolution, attention, and recurrence are special cases of a learnable integral transform.
ITNet uses an MLP kernel that depends on positions and features, enabling data-driven adaptation.
A new method enables LLMs to self-align with ethics using a conscience step and Direct Preference Optimization, without external judges. The technique counters emergent misalignment in scenarios like code hacking.
LLMs can self-correct ethical misalignment via a built-in conscience step.
The method uses a frozen copy of the model itself, avoiding external oversight.
The paper proposes REVEAL++, a continuous formulation of phenotypic grouping in contrastive learning for vision-language alignment using retinal images and clinical risk narratives to predict Alzheimer's disease risk. It replaces hard group assignments with differentiable weights, enabling graded supervision and end-to-end learning. Evaluated on UK Biobank, it outperforms discrete methods.
REVEAL++ models phenotypic similarity as a continuous differentiable function rather than discrete clusters.
It uses soft multi-positive relationships for contrastive learning, reflecting the spectrum of disease risk.
A study comparing Qwen 2.5 7B and XGBoost on clinical prediction reveals that LLM verbalized confidence is epistemically vacuous, an inverse difficulty effect exists, few-shot and SHAP interventions improve accuracy, and a cross-model calibrator reduces calibration error.
LLM verbalized confidence is nearly constant (0.856-0.937) regardless of accuracy, tracking prompt format.
An inverse difficulty effect: LLM accuracy drops when XGBoost is highly confident, but matches it at moderate uncertainty.
DeXposure-Claw is a forecast-grounded agentic supervision system for decentralized finance risk, addressing the over-reading and high-stakes false alarms of general-purpose LLM agents. It uses a graph time-series foundation model to predict exposure networks, deterministic monitors and stress scenarios to generate alerts, and data-health gates to constrain escalation. DeXposure-Bench evaluates system decisions on six axes, including a regulator-aligned false-intervention rate. Experiments on five years of real data validate the system.
DeXposure-Claw improves DeFi risk supervision by routing LLM decisions through structured, forecast-based evidence.
The system uses a graph time-series model to forecast exposure networks and deterministic monitors for alerts and attribution.
This paper proposes a dynamical system model for multi-agent LLM deliberation, where each agent has a hidden internal belief (anchor) that continuously pulls its opinion. The model explains a behavior forbidden by classical consensus rules: an agent's confidence can exceed the convex hull of initial beliefs. Experiments on three open-weight model families show that all anchors have similar influence but differ in location, and only when the anchor is far from initial opinions does deliberation escape the hull.
Each agent in multi-agent LLM deliberation has a hidden anchor (internal belief) that persistently influences its opinion.
The model explains why an agent's confidence in the correct answer can surpass the convex hull of initial beliefs.
This paper presents a systematic experimental analysis of eight state-of-the-art diffusion language models (DLMs) across eight benchmarks spanning reasoning, coding, translation, knowledge, and structured problem solving, considering both generation quality and computational efficiency. It examines the impact of inference-time factors such as denoising steps, context length, block size, and parallel unmasking strategies, and finds that DLM behavior is strongly influenced by generation-time design choices, leading to distinct trade-offs between performance and efficiency. The study provides practical insights into the capabilities and deployment characteristics of contemporary DLMs.
Evaluated 8 DLMs across 8 benchmarks including reasoning, coding, translation, knowledge, and structured problem solving.
A new study introduces a human-in-the-loop pipeline to measure how well undergraduate computer science programs align with curricular guidelines. Applied to CS2013 and CS2023, it found near-constant coverage (~50%) but a drop in cognitive depth achievement from 95% to 76%, reflecting raised expectations in the newer standard. Persistent gaps in parallel computing, programming languages, and systems fundamentals were identified.
A human-in-the-loop pipeline measures curriculum alignment with CS2013 and CS2023.
Program coverage remained near-constant at about 50% across the decade.
A new paper proposes AgenticRei, a deontic policy framework to govern LLM-driven autonomous agents, addressing obligations, dispensations, and policy conflicts beyond current access-control engines.
Autonomous AI agents pose governance challenges beyond simple permit/prohibit, requiring obligation lifecycle, conflict resolution, and dispensations.
Existing systems like XACML, Rego, and Cedar lack these capabilities; AgenticRei fills the gap using a deontic policy language in OWL.