Suppr超能文献

基于前向 Kullback-Leibler 散度优化的乐观强化学习。

Optimistic reinforcement learning by forward Kullback-Leibler divergence optimization.

机构信息

Nara Institute of Science and Technology, Nara, Japan.

出版信息

Neural Netw. 2022 Aug;152:169-180. doi: 10.1016/j.neunet.2022.04.021. Epub 2022 Apr 21.

Abstract

This paper addresses a new interpretation of the traditional optimization method in reinforcement learning (RL) as optimization problems using reverse Kullback-Leibler (KL) divergence, and derives a new optimization method using forward KL divergence, instead of reverse KL divergence in the optimization problems. Although RL originally aims to maximize return indirectly through optimization of policy, the recent work by Levine has proposed a different derivation process with explicit consideration of optimality as stochastic variable. This paper follows this concept and formulates the traditional learning laws for both value function and policy as the optimization problems with reverse KL divergence including optimality. Focusing on the asymmetry of KL divergence, the new optimization problems with forward KL divergence are derived. Remarkably, such new optimization problems can be regarded as optimistic RL. That optimism is intuitively specified by a hyperparameter converted from an uncertainty parameter. In addition, it can be enhanced when it is integrated with prioritized experience replay and eligibility traces, both of which accelerate learning. The effects of this expected optimism was investigated through learning tendencies on numerical simulations using Pybullet. As a result, moderate optimism accelerated learning and yielded higher rewards. In a realistic robotic simulation, the proposed method with the moderate optimism outperformed one of the state-of-the-art RL method.

摘要

本文针对强化学习(RL)中的传统优化方法提出了一种新的解释,即将其作为使用反向 Kullback-Leibler(KL)散度的优化问题,并推导出一种新的优化方法,使用前向 KL 散度代替优化问题中的反向 KL 散度。尽管 RL 最初旨在通过策略优化间接最大化回报,但 Levine 的最新工作提出了一种不同的推导过程,明确考虑了最优性作为随机变量。本文遵循这一概念,并将传统的价值函数和策略学习法则表述为包含最优性的反向 KL 散度的优化问题。关注 KL 散度的不对称性,推导出了前向 KL 散度的新优化问题。值得注意的是,这种新的优化问题可以被视为乐观 RL。这种乐观主义是通过将不确定性参数转换为超参数来直观指定的。此外,当与优先经验重放和资格迹结合使用时,它可以得到增强,这两者都可以加速学习。通过在 Pybullet 上使用数值模拟进行学习趋势的研究,考察了这种预期乐观主义的效果。结果表明,适度的乐观主义加速了学习,并产生了更高的回报。在真实的机器人模拟中,提出的具有适度乐观主义的方法优于一种最先进的 RL 方法。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验