Suppr超能文献

基于示范与好奇心的策略梯度

Policy Gradient From Demonstration and Curiosity.

作者信息

Chen Jie, Xu Wenjun

出版信息

IEEE Trans Cybern. 2023 Aug;53(8):4923-4933. doi: 10.1109/TCYB.2022.3150802. Epub 2023 Jul 18.

Abstract

With reinforcement learning, an agent can learn complex behaviors from high-level abstractions of the task. However, exploration and reward shaping remain challenging for existing methods, especially in scenarios where extrinsic feedback is sparse. Expert demonstrations have been investigated to solve these difficulties, but a tremendous number of high-quality demonstrations are usually required. In this work, an integrated policy gradient algorithm is proposed to boost exploration and facilitate intrinsic reward learning from only a limited number of demonstrations. We achieved this by reformulating the original reward function with two additional terms, where the first term measured the Jensen-Shannon divergence between current policy and the expert's demonstrations, and the second term estimated the agent's uncertainty about the environment. The presented algorithm was evaluated by a range of simulated tasks with sparse extrinsic reward signals, where only limited demonstrated trajectories were provided to each task. Superior exploration efficiency and high average return were demonstrated in all tasks. Furthermore, it was found that the agent could imitate the expert's behavior and meanwhile sustain high return.

摘要

通过强化学习,智能体可以从任务的高级抽象中学习复杂行为。然而,探索和奖励塑造对现有方法来说仍然具有挑战性,尤其是在外在反馈稀疏的场景中。人们已经研究了专家示范来解决这些困难,但通常需要大量高质量的示范。在这项工作中,提出了一种集成策略梯度算法,以促进探索并仅从有限数量的示范中促进内在奖励学习。我们通过用两个附加项重新制定原始奖励函数来实现这一点,其中第一项测量当前策略与专家示范之间的 Jensen-Shannon 散度,第二项估计智能体对环境的不确定性。所提出的算法通过一系列具有稀疏外在奖励信号的模拟任务进行评估,每个任务仅提供有限的示范轨迹。在所有任务中都展示了卓越的探索效率和高平均回报。此外,发现智能体可以模仿专家的行为,同时保持高回报。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验