Suppr超能文献

特征控制作为分层强化学习的内在动机。

Feature Control as Intrinsic Motivation for Hierarchical Reinforcement Learning.

出版信息

IEEE Trans Neural Netw Learn Syst. 2019 Nov;30(11):3409-3418. doi: 10.1109/TNNLS.2019.2891792. Epub 2019 Jan 29.

Abstract

One of the main concerns of deep reinforcement learning (DRL) is the data inefficiency problem, which stems both from an inability to fully utilize data acquired and from naive exploration strategies. In order to alleviate these problems, we propose a DRL algorithm that aims to improve data efficiency via both the utilization of unrewarded experiences and the exploration strategy by combining ideas from unsupervised auxiliary tasks, intrinsic motivation, and hierarchical reinforcement learning (HRL). Our method is based on a simple HRL architecture with a metacontroller and a subcontroller. The subcontroller is intrinsically motivated by the metacontroller to learn to control aspects of the environment, with the intention of giving the agent: 1) a neural representation that is generically useful for tasks that involve manipulation of the environment and 2) the ability to explore the environment in a temporally extended manner through the control of the metacontroller. In this way, we reinterpret the notion of pixel- and feature-control auxiliary tasks as reusable skills that can be learned via an intrinsic reward. We evaluate our method on a number of Atari 2600 games. We found that it outperforms the baseline in several environments and significantly improves performance in one of the hardest games-Montezuma's revenge-for which the ability to utilize sparse data is key. We found that the inclusion of intrinsic reward is crucial for the improvement in the performance and that most of the benefit seems to be derived from the representations learned during training.

摘要

深度强化学习(DRL)的主要关注点之一是数据效率低下的问题,这既源于无法充分利用已获取的数据,也源于幼稚的探索策略。为了缓解这些问题,我们提出了一种 DRL 算法,旨在通过利用未奖励的经验和探索策略来提高数据效率,该策略结合了无监督辅助任务、内在动机和分层强化学习(HRL)的思想。我们的方法基于具有元控制器和子控制器的简单 HRL 架构。子控制器受元控制器的内在激励,以学习控制环境的各个方面,目的是为代理提供:1)一种对涉及环境操作的任务具有普遍用途的神经表示,以及 2)通过元控制器控制以时间扩展方式探索环境的能力。通过这种方式,我们重新解释了像素和特征控制辅助任务的概念,将其作为可通过内在奖励学习的可重用技能。我们在多个雅达利 2600 游戏上评估了我们的方法。我们发现,它在几个环境中优于基线,并且在最难的游戏之一——《蒙特祖玛的复仇》中的性能显著提高,而在该游戏中,利用稀疏数据的能力是关键。我们发现,内在奖励的包含对于提高性能至关重要,而且大部分好处似乎都源于训练期间学到的表示。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验