Li Luchen, Thuruthel Thomas George
Department of Computer Science, University College London, London, United Kingdom.
Front Robot AI. 2024 Sep 30;11:1407519. doi: 10.3389/frobt.2024.1407519. eCollection 2024.
Predicting the consequences of the agent's actions on its environment is a pivotal challenge in robotic learning, which plays a key role in developing higher cognitive skills for intelligent robots. While current methods have predominantly relied on vision and motion data to generate the predicted videos, more comprehensive sensory perception is required for complex physical interactions such as contact-rich manipulation or highly dynamic tasks. In this work, we investigate the interdependence between vision and tactile sensation in the scenario of dynamic robotic interaction. A multi-modal fusion mechanism is introduced to the action-conditioned video prediction model to forecast future scenes, which enriches the single-modality prototype with a compressed latent representation of multiple sensory inputs. Additionally, to accomplish the interactive setting, we built a robotic interaction system that is equipped with both web cameras and vision-based tactile sensors to collect the dataset of vision-tactile sequences and the corresponding robot action data. Finally, through a series of qualitative and quantitative comparative study of different prediction architecture and tasks, we present insightful analysis of the cross-modality influence between vision, tactile and action, revealing the asymmetrical impact that exists between the sensations when contributing to interpreting the environment information. This opens possibilities for more adaptive and efficient robotic control in complex environments, with implications for dexterous manipulation and human-robot interaction.
预测智能体的行动对其环境产生的影响是机器人学习中的一个关键挑战,这在为智能机器人发展更高认知技能方面起着关键作用。虽然当前方法主要依靠视觉和运动数据来生成预测视频,但对于诸如富含接触的操作或高度动态任务等复杂物理交互而言,需要更全面的感官感知。在这项工作中,我们研究了动态机器人交互场景中视觉与触觉之间的相互依存关系。一种多模态融合机制被引入到动作条件视频预测模型中以预测未来场景,该机制用多个感官输入的压缩潜在表示丰富了单模态原型。此外,为实现交互设置,我们构建了一个配备网络摄像头和基于视觉的触觉传感器的机器人交互系统,以收集视觉 - 触觉序列数据集及相应的机器人动作数据。最后,通过对不同预测架构和任务的一系列定性和定量比较研究,我们对视觉、触觉和动作之间的跨模态影响进行了深入分析,揭示了在解释环境信息时不同感官之间存在的不对称影响。这为在复杂环境中实现更具适应性和高效性的机器人控制开辟了可能性,对灵巧操作和人机交互具有重要意义。