This paper presents a hierarchical control framework using model predictive control (MPC) and reinforcement learning (RL) for active roll control to manage lateral load transfer during autonomous racing of a wheeled quadruped. The framework integrates offline time-optimal raceline generation, an online MPC planner that actively minimizes the lateral Load Transfer Ratio (LTR), and a low-level, whole-body RL policy deployed directly onto the robot's 16 actuators. Physical experiments show that active roll control reduces mean LTR by up to 44%, improves fastest lap time by 8.7%, and boosts peak lateral acceleration by 21.3% to 1.98 m/s², maintaining robust high-speed stability.
Hierarchical control framework combining MPC and RL to actively manage lateral load transfer
Robot leg actuators act as active suspension, with knee joints generating anti-roll torque
NavIsaacLab, a framework built on Isaac Lab, enables physics-based and photo-realistic simulations of pedestrians and scenes for benchmarking human-aware robot navigation. It leverages GPU parallel simulation and data-driven pedestrian models (trajectory diffusion + adversarial motion learning) to overcome the scarcity of diverse, high-quality scenario data, providing a robust benchmark for navigation algorithms.
NavIsaacLab uses photo-realistic rendering and GPU parallel simulation to provide real-time 3D visual feedback.
It employs a trajectory diffusion model and adversarial motion learning controller for realistic, controllable pedestrian motion.
This paper introduces TaskNPoint, a training protocol that enables humanoid robots to learn dynamic skills from a single human demonstration and under an hour of GPU training. By focusing on a critical interaction window, the protocol successfully taught a Unitree G1 humanoid to perform tennis strokes, soccer kicks, and pick-and-place tasks without per-task reward tuning.
TaskNPoint leverages a coach-learner division of labor with minimal human input.
Dynamic skill learning reduces to mastering a short crucial trajectory segment.
RoboTales is a low-cost robotic storytelling system that animates narratives using expressive sock puppetry. Implemented autonomously on a Baxter robot as a test case, RoboTales synchronizes narration, gestures, and mouth movements to perform character-driven stories. In a pilot study, puppet-based storytelling outperformed a gesture-only mode, producing higher HRIES ratings and improved story recall, suggesting that embodied puppetry enhances engagement and narrative comprehension. Designed to be modular and platform-agnostic, RoboTales can be adapted to other manipulators and offers a screen-free alternative to passive media, supporting future deployment in child-centered learning environments.
RoboTales is a low-cost robotic storytelling system using expressive sock puppetry.
Autonomous implementation on Baxter robot synchronizes narration, gestures, and mouth movements.
OmniContact is a hierarchical framework using contact flow (CF) representation to chain meta-skills for long-horizon humanoid loco-manipulation. Low-level CF-Track learns a unified skill library, while high-level CF-Gen synthesizes future contact flow sequences. Experiments achieve 98.7% on Carry Box and 76.5% on Push-Stack Boxes, outperforming baselines by 40.9% (meta-skill) and 66.5% (chaining). The framework integrates with VLMs for semantic task decomposition.
Proposes OmniContact with contact flow as a compact shared interface between planning and execution
CF-Track learns reusable skills; CF-Gen generates future contact sequences for chaining and recovery
This paper presents the first morphology-specific closed-loop task-space control framework for logarithmic-spiral continuum arms. Using a segmented tendon-driven model and online Jacobian error compensation (Broyden update and Kalman filter), it achieves accurate robust control, outperforming piecewise-constant-curvature methods in simulations, and enables manipulations like grasping and obstacle-assisted motions.
First closed-loop control framework tailored to logarithmic-spiral morphology
Combines analytical Jacobian with online error compensation
This paper introduces LiMoDE, a two-stage learning scheme using Mixture of Dynamic Experts for lifelong robot manipulation. It first learns prior knowledge via multi-task pre-training with dynamic MoE, then adapts to new tasks with a lifelong MoE mechanism. Experiments show superior performance on simulation and real-world tasks.
LiMoDE uses two stages: multi-task pre-training (dynamic MoE) and task adaptation (lifelong MoE).
Dynamic MoE activates heterogeneous experts based on motion information for short-term manipulations.
This paper proposes RMTL (Reinforced Micro-task Learning), which decomposes long-horizon manipulation tasks into language-described micro-tasks and trains an agent to switch between them. Using multi-view VLM rewards, reverse curriculum, and a hierarchical policy, RMTL provides more informative reward signals than single-prompt VLM rewards, enabling faster learning. Experiments on the Fetch manipulation environment validate its effectiveness.
Single-prompt VLM rewards are flat for much of the trajectory, hindering early progress detection in long-horizon tasks.
RMTL decomposes tasks into micro-tasks, each with its own language prompt, and trains the agent to switch between them.
Researchers developed a physically grounded simulation of a blood capillary network, training deep RL agents to navigate via chemotaxis. They systematically mapped the physical limits of navigation, discovered a forbidden regime, and observed agents independently discovering multiple universal strategies. Without retraining, agents perform targeted blocking and unblocking of capillary flow, restoring throughput to healthy baseline levels.
Developed a physically grounded simulation of blood capillary network with realistic hydrodynamics and RBC dynamics
Deep RL agents successfully navigate via chemotaxis
A new fully unsupervised method, VMTAD, uses transformer architecture and a memory module to detect obstacles in dynamic agricultural scenes in real-time. It achieves state-of-the-art performance on a rapeseed dataset with 0.973 detection and 0.997 segmentation AUC, and a lightweight variant runs in 14 ms.
VMTAD is a fully unsupervised, real-time obstacle detection method for dynamic agricultural scenes.
It uses a memory module to leverage temporal context from video frames, handling motion-induced dynamics.
This paper investigates weight pruning for Vision-Language Models (VLMs) in egocentric visual understanding to achieve low-latency inference while preserving doubly-correct predictions—both accurate and evidence-grounded. Existing pruning methods often maintain evidence localization but degrade accuracy. The authors propose a rationale-informed pruning strategy that aligns evidence with decisions, achieving state-of-the-art accuracy and doubly-correct predictions on egocentric video benchmarks.
Weight pruning reduces VLM latency for on-board processing in human-robot collaboration
Existing pruning preserves evidence localization but harms prediction accuracy
SwarmFly is an open-source MATLAB-based simulation platform for UAV swarms, addressing issues of poor maintenance, steep learning curves, and single-scenario designs. It supports four coordination modes, a plugin architecture, and real-time maps, validated through eight experiments measuring accuracy, wind tolerance, fault recovery, endurance, and airspace compliance.
SwarmFly is a MATLAB platform supporting four swarm coordination modes (leader-follower, decentralized, heterogeneous relay, heterogeneous speed)
Plugin architecture allows researchers to add features without modifying core code
This paper introduces HALO, a visuomotor policy with attention-based memory retrieval for long-horizon robot control, addressing spurious correlations and error accumulation in imitation learning.
HALO distills vision-language model priors to suppress spurious correlations.
HALO uses sparse attention to reduce memory error accumulation in closed-loop control.
The paper extends Parametric Control Barrier Function (Parametric-CBF) by embedding causality inference to explicitly reason over inter-vehicle influence, enabling an adaptive safety-critical controller that avoids overly conservative behavior and improves task efficiency in multi-vehicle interactions.
Embedds causality inference into Parametric-CBF to handle inter-vehicle influence
This paper proposes RGB, an RL-guided whole-body MPPI framework that uses a pretrained RL policy as a sampling prior and MPPI for online correction, achieving robust and precise humanoid control without retraining. Simulations on a Unitree G1 humanoid demonstrate stable 280Hz control and improved precision over pure RL.
RGB uses a pretrained RL policy as a sampling prior for MPPI, enabling new objectives without retraining.
MPPI corrects the RL prior online to reduce drift and track whole-body reference signals.
AeroCast is a probabilistic trajectory prediction framework combining Transformer encoder with Mixture Density Network to predict Gaussian mixture distributions over future 3D displacements. It reduces error by 50% on a quadrotor corpus and runs at 0.1ms per sample.
Combines Transformer encoder with Mixture Density Network for probabilistic 3D trajectory prediction.
Achieves 50% reduction in Average and Final Displacement Error over baselines.
A novel dataset and framework, SurveilNav, enables robots to collaborate with multi-view surveillance systems for object goal navigation. By integrating active camera scheduling, joint 2D/3D mapping, VLM-based value estimation, and collaborative target verification, it overcomes the limitations of single-robot perception and fixed-camera blind spots. Experiments on HM3D demonstrate state-of-the-art performance in exploration efficiency and navigation success rate.
New dataset with 206 cameras across 74 floors for systematic evaluation of multi-view collaboration
SurveilNav framework integrates active camera scheduling, 2D/3D mapping, VLM-based value estimation, and collaborative verification
Proposes ADM-Fusion, an end-to-end deep learning multi-sensor fusion method using an adaptive sensor mixture-of-experts framework with content-aware routing to dynamically weigh sensor inputs. It features separate translation and rotation branches coupled via cross-task attention. Trained on CARLA-LOC simulated dataset and fine-tuned on KITTI real-world data, it demonstrates robust performance under sensor degradation while matching state-of-the-art methods.
Adaptive sensor mixture-of-experts with content-aware routing for real-time dynamic weighting.
Separate translation and rotation branches with cross-task attention for task-specific and shared information.
This paper introduces a novel invariant Kalman filtering approach for extended pose estimation in multi-IMU articulated rigid-body systems. By proposing a relative L-extended pose Lie group representation and incorporating joint kinematic constraints as noise-free pseudo-measurements within an iterated IEKF, the method achieves faster convergence and over 50% reduction in RMSE compared to existing filters on both a UR5e robot and a human leg.
Proposes relative L-extended pose Lie group representation for kinematic-tree systems with one IMU per body
Incorporates joint constraints as noise-free pseudo-measurements in an iterated IEKF, preserving convergence and consistency guarantees
A new method called Latent Sequence Optimization (LSO) enables precise physics-based motion tracking by optimizing over sequences of latents in Behavioral Foundation Models, validated on a real humanoid robot.
Behavioral Foundation Models (BFMs) organize physically plausible behaviors into a latent space but lack time-varying objective support.
NavWM is a unified navigation world model that integrates latent world reasoning, multimodal action prediction, and controllable visual generation. By introducing an anchor-based multimodal trajectory forecasting framework, it generates a diverse action space and uses visual foresight for robust closed-loop planning. Experiments show significant improvements in high-fidelity future state generation and zero-shot navigation success.
NavWM unifies perception, generation, and control within a shared spatiotemporal framework.
Latent world tokens distill geometric and semantic priors for robust structural understanding.
DynaWM improves bipedal-wheeled robot locomotion on continuous stairs by using a world model regularizer for terrain encoding and a momentum target encoder for stable distillation, enabling smoother and more adaptable movement in simulation and real hardware.
DynaWM introduces a world model regularizer to enforce forward-dynamics awareness and preserve terrain geometry.
MinInter selects source demonstrations requiring the least interpolation to generate higher-quality synthetic data for imitation learning. Experiments on 12 manipulation tasks show consistent improvements in data generation and policy success rates, with largest gains on contact-rich, long-horizon tasks.
MinInter minimizes interpolation by selecting the best source demonstration for each initial configuration.
Consistently improves success rates on 12 manipulation tasks from the MimicGen benchmark.
SPACE framework uses Cartesian state delta as a universal action representation, with State Prediction and Adaptive Command Execution to address issues in behavior cloning across robots with different dynamics. Experiments show it outperforms direct command prediction and remains robust under dynamics shifts.
Proposes Cartesian state delta as universal action representation
SPACE handles variation across embodiments, hardware units, and within a robot
TurboMPC is a differentiable MPC solver that runs entirely on the GPU, supporting state and control inequality constraints, implicit integrators, cross-time-coupled costs, and slack variables. It achieves up to 15× and 58× speedups over state-of-the-art CPU and GPU differentiable solvers, respectively, and scales to planning horizons over 8000 knot points. Deployed on a full-scale car for minimum-time racing, GPU-accelerated Bayesian optimization tuning yields significantly faster driving.
Runs entirely on GPU, combining SQP, ADMM, implicit differentiation, and JAX-CUDA
Up to 15× speedup over CPU solvers and 58× over GPU solvers
This note describes an integration of the sim-to-real performance estimate with betting (from Chen et al.) and the safe anytime-valid inference (from Ramdas et al.), using scaled simulators to produce efficient, reliable certificates for mean estimates, especially valuable in robot performance testing.
Integrates sim-to-real performance estimate with betting and anytime-valid inference
Uses scaled simulators for efficient, reliable mean estimate certificates
Reinforcement learning can train bimanual dexterous hands to play piano in physics simulation with high note accuracy, but for high-DoF hands, relying solely on task rewards or IK inversion often leads to unnatural postures and joint overextension. The proposed Adversarial Posture Regularization (APR) uses a small amount of casual human playing data to match the posture distribution of the policy with a human prior via an adversarial objective, encouraging more human-like hand shapes. The authors collect and release unstructured hand motion data using a consumer-grade Meta Quest 3 and retarget it to the Shadow Hand. APR achieves significantly better performance than prior methods on human-likeness metrics (cPSI, BSE, FAC) and visual quality.
This workshop report captures discussions from the Lorentz Center Workshop "Engineering Reliable Autonomous Systems" (ERAS), held June 10–14, 2024. It focuses on verification and validation, real-world engineering, and safe software architectures for autonomous systems, resulting in a catalogue of challenges and a roadmap to solutions. Some challenges can be addressed by existing academic techniques not yet widely adopted in practice; others require further research.
Co-organized by FMAS and AREA communities, bringing together academia, industry, and domain experts.
Three main topics: verification/validation, engineering real-world systems, and software architectures for safety.
This paper introduces FEARL, a framework that decomposes robot policies into a large controller and a small safety module, enabling formal verification of safety-critical properties while preserving the expressive power of foundation models. Experiments in simulation and on a physical robot demonstrate its effectiveness.
FEARL splits robot policy into a large perception/reasoning controller and a verifiable safety module.
The safety module uses low-dimensional sensor data, making formal verification tractable.