Bai Chenjia, Liu Peng, Liu Kaiyu, Wang Lingxiao, Zhao Yingnan, Han Lei, Wang Zhaoran
IEEE Trans Neural Netw Learn Syst. 2023 Aug;34(8):4776-4790. doi: 10.1109/TNNLS.2021.3129160. Epub 2023 Aug 4.
Efficient exploration remains a challenging problem in reinforcement learning, especially for tasks where extrinsic rewards from environments are sparse or even totally disregarded. Significant advances based on intrinsic motivation show promising results in simple environments but often get stuck in environments with multimodal and stochastic dynamics. In this work, we propose a variational dynamic model based on the conditional variational inference to model the multimodality and stochasticity. We consider the environmental state-action transition as a conditional generative process by generating the next-state prediction under the condition of the current state, action, and latent variable, which provides a better understanding of the dynamics and leads to a better performance in exploration. We derive an upper bound of the negative log likelihood of the environmental transition and use such an upper bound as the intrinsic reward for exploration, which allows the agent to learn skills by self-supervised exploration without observing extrinsic rewards. We evaluate the proposed method on several image-based simulation tasks and a real robotic manipulating task. Our method outperforms several state-of-the-art environment model-based exploration approaches.
在强化学习中,高效探索仍然是一个具有挑战性的问题,特别是对于那些环境中的外部奖励稀疏甚至完全被忽略的任务。基于内在动机的重大进展在简单环境中显示出有希望的结果,但在具有多模态和随机动态的环境中往往会陷入困境。在这项工作中,我们提出了一种基于条件变分推理的变分动态模型,以对多模态和随机性进行建模。我们将环境状态 - 动作转换视为一个条件生成过程,通过在当前状态、动作和潜在变量的条件下生成下一状态预测,这有助于更好地理解动态,并在探索中带来更好的性能。我们推导了环境转换的负对数似然的上界,并将该上界用作探索的内在奖励,这使得智能体能够通过自我监督探索来学习技能,而无需观察外部奖励。我们在几个基于图像的模拟任务和一个真实的机器人操作任务上评估了所提出的方法。我们的方法优于几种基于环境模型的最先进探索方法。