Belousov Boris, Peters Jan
Department of Computer Science, Technische Universität Darmstadt, 64289 Darmstadt, Germany.
Max Planck Institute for Intelligent Systems, 72076 Tübingen, Germany.
Entropy (Basel). 2019 Jul 10;21(7):674. doi: 10.3390/e21070674.
An optimal feedback controller for a given Markov decision process (MDP) can in principle be synthesized by value or policy iteration. However, if the system dynamics and the reward function are unknown, a learning agent must discover an optimal controller via direct interaction with the environment. Such interactive data gathering commonly leads to divergence towards dangerous or uninformative regions of the state space unless additional regularization measures are taken. Prior works proposed bounding the information loss measured by the Kullback-Leibler (KL) divergence at every policy improvement step to eliminate instability in the learning dynamics. In this paper, we consider a broader family of -divergences, and more concretely α -divergences, which inherit the beneficial property of providing the policy improvement step in closed form at the same time yielding a corresponding dual objective for policy evaluation. Such entropic proximal policy optimization view gives a unified perspective on compatible actor-critic architectures. In particular, common least-squares value function estimation coupled with advantage-weighted maximum likelihood policy improvement is shown to correspond to the Pearson χ 2 -divergence penalty. Other actor-critic pairs arise for various choices of the penalty-generating function . On a concrete instantiation of our framework with the α -divergence, we carry out asymptotic analysis of the solutions for different values of α and demonstrate the effects of the divergence function choice on common standard reinforcement learning problems.
原则上,给定马尔可夫决策过程(MDP)的最优反馈控制器可以通过值迭代或策略迭代来合成。然而,如果系统动态和奖励函数未知,学习智能体必须通过与环境的直接交互来发现最优控制器。除非采取额外的正则化措施,这种交互式数据收集通常会导致朝着状态空间的危险或无信息区域发散。先前的工作提出在每个策略改进步骤中限制用库尔贝克-莱布勒(KL)散度衡量的信息损失,以消除学习动态中的不稳定性。在本文中,我们考虑更广泛的 -散度族,具体而言是α -散度,它继承了以封闭形式提供策略改进步骤的有益特性,同时为策略评估产生相应的对偶目标。这种熵近端策略优化观点为兼容的演员-评论家架构提供了统一的视角。特别地,常见的最小二乘价值函数估计与最大似然策略改进相结合,被证明对应于皮尔逊χ2 -散度惩罚。对于惩罚生成函数的各种选择,会出现其他演员-评论家对。在我们具有α -散度的框架的具体实例化中,我们对不同α值的解进行渐近分析,并展示散度函数选择对常见标准强化学习问题的影响。