Department of Computer Science and Technology, Beijing National Research Centre for Information Science and Technology, Tsinghua University, Beijing 100084, China.
Department of Computer Science and Technology, Beijing National Research Centre for Information Science and Technology, Tsinghua University, Beijing 100084, China.
Neural Netw. 2024 Aug;176:106347. doi: 10.1016/j.neunet.2024.106347. Epub 2024 Apr 27.
Reinforcement learning has achieved promising results on robotic control tasks but struggles to leverage information effectively from multiple sensory modalities that differ in many characteristics. Recent works construct auxiliary losses based on reconstruction or mutual information to extract joint representations from multiple sensory inputs to improve the sample efficiency and performance of reinforcement learning algorithms. However, the representations learned by these methods could capture information irrelevant to learning a policy and may degrade the performance. We argue that compressing information in the learned joint representations about raw multimodal observations is helpful, and propose a multimodal information bottleneck model to learn task-relevant joint representations from egocentric images and proprioception. Our model compresses and retains the predictive information in multimodal observations for learning a compressed joint representation, which fuses complementary information from visual and proprioceptive feedback and meanwhile filters out task-irrelevant information in raw multimodal observations. We propose to minimize the upper bound of our multimodal information bottleneck objective for computationally tractable optimization. Experimental evaluations on several challenging locomotion tasks with egocentric images and proprioception show that our method achieves better sample efficiency and zero-shot robustness to unseen white noise than leading baselines. We also empirically demonstrate that leveraging information from egocentric images and proprioception is more helpful for learning policies on locomotion tasks than solely using one single modality.
强化学习在机器人控制任务中取得了令人瞩目的成果,但在有效利用来自多个在许多特性上存在差异的感觉模式的信息方面仍面临挑战。最近的研究工作基于重建或互信息构建辅助损失,从多个感觉输入中提取联合表示,以提高强化学习算法的样本效率和性能。然而,这些方法学习的表示可能会捕获与学习策略无关的信息,从而降低性能。我们认为,压缩从自我中心图像和本体感受中学习到的联合表示中关于原始多模态观测的信息是有帮助的,并提出了一种多模态信息瓶颈模型来学习与任务相关的联合表示。我们的模型压缩并保留多模态观测中的预测信息,以学习压缩的联合表示,该表示融合了视觉和本体反馈的互补信息,同时过滤掉原始多模态观测中的与任务无关的信息。我们提出了最小化我们的多模态信息瓶颈目标的上界,以便进行计算上可行的优化。使用自我中心图像和本体感受的几个具有挑战性的运动任务的实验评估表明,与领先的基线相比,我们的方法在零样本鲁棒性方面具有更好的样本效率和对未见过的白噪声的鲁棒性。我们还实证证明,与仅使用单一模态相比,利用自我中心图像和本体感受的信息更有助于学习运动任务的策略。