Selected Robotics Papers from arXiv – November 8, 2025

Selected robotics papers from arXiv! 👏👏👏

#Robotics

ForeRobo: Unlocking Infinite Simulation Data for 3D Goal-Driven Robotic Manipulation

Date: November 6, 2025

Authors: Dexin Wang, Faliang Chang, Chunsheng Liu

Link: http://arxiv.org/abs/2511.04381v1Efficiently utilizing simulation to acquire advanced manipulation skills is both challenging and highly important. We introduce ForeRobo, a generative robotic agent that autonomously acquires manipulation skills driven by vision target states using generated simulations. We advocate for combining generative paradigms with classical control rather than directly learning low-level policies. Our approach equips robotic agents with a self-guided Propose-Generate-Learn-Execute loop. The agent first proposes the skills to acquire and constructs the corresponding simulation environment; then it configures objects to generate target states consistent with the skills (ForeGen). Subsequently, the virtually infinite data generated by ForeGen is used to train the proposed state generation model (ForeFormer), which establishes point-to-point correspondences by predicting the 3D target positions of each point in the current scene state based on the scene state and task instructions. Finally, classical control algorithms drive the robot to execute actions based on vision target states in real-world environments. Compared to end-to-end policy learning methods, ForeFormer provides better interpretability and execution efficiency. We trained and benchmarked ForeFormer on various rigid and articulated object manipulation tasks, observing an average improvement of 56.32% over the state-of-the-art state generation models, demonstrating strong generalization capabilities across different manipulation modes. Furthermore, in real-world evaluations involving over 20 robotic tasks, ForeRobo achieved zero-shot simulation-to-reality transfer and exhibited significant generalization capabilities, with an average success rate of 79.28%.

GentleHumanoid: Learning Upper Limb Compliance for Rich Human-Robot and Object Interactions

Date: November 6, 2025

Authors: Qingzhou Lu, Yao Feng, Baiyu Shi, Michael Piseno, Zhenan Bao, C. Karen Liu

Link: http://arxiv.org/abs/2511.04679v1Humanoid robots are expected to operate in human-centered domains where safe and natural physical interactions are crucial. However, most recent reinforcement learning (RL) strategies emphasize rigid tracking and suppress external forces. Existing impedance-enhanced methods are often limited to base or end-effector control and focus on resisting extreme forces rather than achieving compliance. We introduce the GentleHumanoid framework, which integrates impedance control into whole-body motion tracking strategies to achieve upper limb compliance. At its core is a unified spring-based formulation that simulates resistive contact (producing restorative forces when pressing against surfaces) and guiding contact (sampled from human motion data). This formulation ensures dynamic consistency of forces at the shoulder, elbow, and wrist while allowing the policy to engage in diverse interaction scenarios. Task-adjustable force thresholds further support safety. We evaluated multiple tasks requiring varying levels of compliance on simulations and the Unitree G1 humanoid robot, including gentle hugging, sit-to-stand assistance, and safe object manipulation. Our policy consistently reduced peak contact forces while maintaining task success, resulting in smoother and more natural interactions compared to the baseline. These results highlight a step towards humanoid robots that can safely and effectively collaborate with humans and handle objects in real-world environments.

Soft Interaction Real-to-Sim Robot Policy Evaluation Based on Gaussian Splatting Simulation

Date: November 6, 2025

Authors: Kaifeng Zhang, Shuo Sha, Hanxiao Jiang, Matthew Loper, Hyunjong Song, Guangyan Cai, Zhuo Xu, Xiaochen Hu, Changxi Zheng, Yunzhu Li

Link: http://arxiv.org/abs/2511.04665v1Robotic manipulation strategies are rapidly evolving, but direct evaluation in the real world remains costly, time-consuming, and difficult to reproduce, especially for tasks involving deformable objects. Simulation offers a scalable and systematic alternative, but existing simulators often fail to capture the coupled visual and physical complexities of soft interactions. We propose a framework for real-to-sim policy evaluation that constructs soft digital twins from real-world videos and uses 3D Gaussian Splatting for realistic rendering of robots, objects, and environments. We validated our approach on representative deformable manipulation tasks, including plush toy packaging, rope routing, and T-block pushing, demonstrating that simulated rollouts are highly correlated with actual execution performance and revealing key behavioral patterns of learned policies. Our results indicate that combining physical information reconstruction with high-quality rendering can achieve reproducibility, scalability, and accuracy in evaluating robotic manipulation strategies. Website: https://real2sim-eval.github.io/.

X-Diffusion: Training Diffusion Policies on Cross-Embodiment Human Demonstrations

Date: November 6, 2025

Authors: Maximus A. Pace, Prithwish Dan, Chuanruo Ning, Atiksh Bhardwaj, Audrey Du, Edward W. Duan, Wei-Chiu Ma, Kushal Kedia

Link: http://arxiv.org/abs/2511.04671v1Human videos can be recorded quickly and at scale, making them an attractive source of training data for robotic learning. However, fundamental differences in body structure between humans and robots lead to mismatches in action execution. Therefore, directly applying human hand movements through direct kinematic redirection results in physically infeasible actions for robots. Despite these low-level discrepancies, human demonstrations provide valuable motion cues on how to manipulate and interact with objects. Our key idea is to leverage the forward diffusion process: as noise is added to the actions, low-level execution discrepancies gradually disappear while high-level task guidance is preserved. We propose X-Diffusion, a principled framework for training diffusion policies that maximizes the utility of human data without learning dynamically infeasible motions. X-Diffusion first trains a classifier to predict whether a noisy action is executed by a human or a robot. Then, human actions are incorporated into policy training only when the classifier cannot distinguish their body structures. Actions consistent with robot execution are supervised for fine denoising at low noise levels, while mismatched human actions provide coarse guidance at higher noise levels. Our experiments show that simple co-training under mismatches reduces policy performance, while X-Diffusion consistently improves performance. In five manipulation tasks, X-Diffusion’s average success rate exceeds the best baseline by 16%. The project website can be found at https://portal-cornell.github.io/X-Diffusion/.

Regret Lower Bound for Distributed Multi-Agent Stochastic Shortest Path Problem

Date: November 6, 2025

Authors: Utkarsh U. Chavan, Prashant Trivedi, Nandyala Hemachandra

Link: http://arxiv.org/abs/2511.04594v1Multi-agent systems (MAS) are at the core of applications such as swarm robotics and traffic routing, where agents must coordinate in a decentralized manner to achieve common goals. The stochastic shortest path (SSP) problem provides a natural framework for decentralized control in such settings. While the SSP learning problem in single-agent settings has been extensively studied, the decentralized multi-agent variant remains underexplored. In this work, we take a step towards filling this gap. We investigate decentralized multi-agent SSPs (Dec-MASSPs) under linear function approximation, where state transition dynamics and costs are represented by linear models. By applying a novel argument based on symmetry, we identify the structure of optimal policies. Our main contribution is the construction of the first regret lower bound for hard-to-learn instances based on any number of agents n in this setting. In K rounds, our regret lower bound is Ω(√K), highlighting the inherent learning difficulty in Dec-MASSPs. These insights illuminate the learning complexity of decentralized control and can further guide the design of efficient learning algorithms in multi-agent systems.

Evo-1: A Lightweight Visual-Language-Action Model with Semantic Alignment Preservation

Date: November 6, 2025

Authors: Tao Lin, Yilei Zhong, Yuxin Du, Jingjing Zhang, Jiting Liu, Yinxinyu Chen, Encheng Gu, Ziyan Liu, Hongyi Cai, Yanwen Zou, Lixing Zou, Zhaoye Zhou, Gen Li, Bo Zhao

Link: http://arxiv.org/abs/2511.04555v1Visual-Language-Action (VLA) models have become a powerful framework that unifies perception, language, and control, enabling robots to execute various tasks through multimodal understanding. However, current VLA models often contain a large number of parameters and heavily rely on pre-training on large-scale robotic data, leading to high computational costs during training and limited deployment capabilities for real-time inference. Moreover, most training paradigms tend to degrade the perceptual representations of the visual-language backbone, resulting in overfitting and poor generalization to downstream tasks. In this work, we propose Evo-1, a lightweight VLA model that reduces computation and improves deployment efficiency while maintaining strong performance without pre-training on robotic data. Evo-1 is based on a native multimodal visual-language model (VLM), combined with a novel cross-modal diffusion transformer and an optimized integration module, forming an effective architecture. We further introduce a two-stage training paradigm that gradually aligns actions with perception, preserving the representations of the VLM. Notably, Evo-1 has only 770 million parameters, achieving state-of-the-art results on the Meta-World and RoboTwin suites, surpassing the previous best models by 12.4% and 6.9%, respectively, and achieving competitive results of 94.8% on LIBERO. In real-world evaluations, Evo-1 achieved a success rate of 78% with high inference frequency and low memory overhead, outperforming all baseline methods. We release the code, data, and model weights to facilitate future research on lightweight and efficient VLA models.

Temporal Action Selection for Action Chunking

Date: November 6, 2025

Authors: Yueyang Weng, Xiaopeng Zhang, Yongjin Mu, Yingcong Zhu, Yanjie Li, Qi Liu

Link: http://arxiv.org/abs/2511.04421v1Action decomposition is a widely adopted approach in Learning from Demonstration (LfD). By modeling multi-step action chunks instead of single-step actions, action decomposition significantly enhances the modeling capacity of human expert policies. However, reduced decision frequency limits the utilization of recent observations, decreasing reactivity—especially evident in sensor noise and dynamic environmental changes. Existing efforts primarily rely on a trade-off between reactivity and decision consistency without achieving both simultaneously. To address this limitation, we propose a new algorithm, the Temporal Action Selector (TAS), which caches predicted action chunks from multiple time steps and dynamically selects the best action through a lightweight selector network. TAS achieves balanced optimization across three key dimensions: reactivity, decision consistency, and motion coherence. Experiments on multiple tasks with different underlying policies demonstrate that TAS significantly improves success rates—absolute gains of up to 73.3%. Furthermore, integrating TAS with foundational policies and residual reinforcement learning (RL) greatly enhances training efficiency and performance. Experiments on both simulated and physical robots confirm the effectiveness of this approach.

Near-Global Convergence Design for Landmark-Inertial SLAM Based on Synchronized Observers

Date: November 6, 2025

Authors: Arkadeep Saha, Pieter van Goor, Antonio Franchi, Ravi Banavar

Link: http://arxiv.org/abs/2511.04531v1Landmark Inertial Simultaneous Localization and Mapping (LI-SLAM) is the problem of estimating the positions of landmarks in the environment and the robot’s position relative to these landmarks using landmark position measurements and Inertial Measurement Unit (IMU) measurements. This paper proposes a continuous-time nonlinear observer for LI-SLAM and analyzes the observer in a base space containing all observable states. In the proof section, we establish local exponential stability and almost global asymptotic stability of the error dynamics in the base space, verified through simulations.

BoRe-Depth: Boundary-Optimized Self-Supervised Monocular Depth Estimation for Embedded Systems

Date: November 6, 2025

Authors: Chang Liu, Juan Li, Sheng Zhang, Chang Liu, Jie Li, Xu Zhang

Link: http://arxiv.org/abs/2511.04388v1Depth estimation is one of the key technologies for achieving 3D perception in drones. Due to its cost-effectiveness, monocular depth estimation has been widely studied, but existing methods face challenges in depth estimation performance and object boundary clarity on embedded systems. In this paper, we propose a novel monocular depth estimation model, BoRe-Depth, which contains only 8.7M parameters. It accurately estimates depth maps on embedded systems and significantly improves boundary quality. First, we design an Enhanced Feature Adaptive Fusion (EFAF) module that adaptively fuses depth features to enhance boundary detail representation. Second, we integrate semantic knowledge into the encoder to improve object recognition and boundary perception capabilities. Finally, BoRe-Depth is deployed on the NVIDIA Jetson Orin, operating at 50.7 FPS efficiency. We demonstrate that the proposed model significantly outperforms previous lightweight models on multiple challenging datasets and provide a detailed ablation study for the proposed method. The code can be found at https://github.com/liangxiansheng093/BoRe-Depth.

GraSP-VLA: Graph-Based Structured Symbolic Action Representation for Long-Horizon Planning with VLA Policies

Date: November 6, 2025

Authors: Maëlic Neau, Zoe Falomir, Paulo E. Santos, Anne-Gwenn Bosser, Cédric Buche

Link: http://arxiv.org/abs/2511.04357v1Deploying autonomous robots that can learn new skills from demonstrations is a significant challenge in modern robotics. Existing solutions typically adopt end-to-end visual-language-action (VLA) imitation learning models or action model learning (AML) in symbolic approaches. On one hand, current VLA models are limited by the lack of high-level symbolic planning, hindering their capabilities in long-term tasks. On the other hand, symbolic methods in AML lack perspectives on generalization and scalability. In this paper, we propose a new neural-symbolic method, GraSP-VLA, which is a framework that uses continuous scene graph representations to generate symbolic representations of human demonstrations. This representation is used to generate new planning domains during inference and serves as a coordinator for low-level VLA policies, increasing the number of actions that can be continuously replicated. Our results indicate that GraSP-VLA is effective in automatically generating planning domain tasks from observations using symbolic representations. Furthermore, results from real-world experiments suggest that our continuous scene graph representation has the potential to coordinate low-level VLA policies in long-term tasks.

Leave a Comment