• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于前向 Kullback-Leibler 散度优化的乐观强化学习。

Optimistic reinforcement learning by forward Kullback-Leibler divergence optimization.

机构信息

Nara Institute of Science and Technology, Nara, Japan.

出版信息

Neural Netw. 2022 Aug;152:169-180. doi: 10.1016/j.neunet.2022.04.021. Epub 2022 Apr 21.

DOI:10.1016/j.neunet.2022.04.021
PMID:35533503
Abstract

This paper addresses a new interpretation of the traditional optimization method in reinforcement learning (RL) as optimization problems using reverse Kullback-Leibler (KL) divergence, and derives a new optimization method using forward KL divergence, instead of reverse KL divergence in the optimization problems. Although RL originally aims to maximize return indirectly through optimization of policy, the recent work by Levine has proposed a different derivation process with explicit consideration of optimality as stochastic variable. This paper follows this concept and formulates the traditional learning laws for both value function and policy as the optimization problems with reverse KL divergence including optimality. Focusing on the asymmetry of KL divergence, the new optimization problems with forward KL divergence are derived. Remarkably, such new optimization problems can be regarded as optimistic RL. That optimism is intuitively specified by a hyperparameter converted from an uncertainty parameter. In addition, it can be enhanced when it is integrated with prioritized experience replay and eligibility traces, both of which accelerate learning. The effects of this expected optimism was investigated through learning tendencies on numerical simulations using Pybullet. As a result, moderate optimism accelerated learning and yielded higher rewards. In a realistic robotic simulation, the proposed method with the moderate optimism outperformed one of the state-of-the-art RL method.

摘要

本文针对强化学习(RL)中的传统优化方法提出了一种新的解释,即将其作为使用反向 Kullback-Leibler(KL)散度的优化问题,并推导出一种新的优化方法,使用前向 KL 散度代替优化问题中的反向 KL 散度。尽管 RL 最初旨在通过策略优化间接最大化回报,但 Levine 的最新工作提出了一种不同的推导过程,明确考虑了最优性作为随机变量。本文遵循这一概念,并将传统的价值函数和策略学习法则表述为包含最优性的反向 KL 散度的优化问题。关注 KL 散度的不对称性,推导出了前向 KL 散度的新优化问题。值得注意的是,这种新的优化问题可以被视为乐观 RL。这种乐观主义是通过将不确定性参数转换为超参数来直观指定的。此外,当与优先经验重放和资格迹结合使用时,它可以得到增强,这两者都可以加速学习。通过在 Pybullet 上使用数值模拟进行学习趋势的研究,考察了这种预期乐观主义的效果。结果表明,适度的乐观主义加速了学习,并产生了更高的回报。在真实的机器人模拟中,提出的具有适度乐观主义的方法优于一种最先进的 RL 方法。

相似文献

1
Optimistic reinforcement learning by forward Kullback-Leibler divergence optimization.基于前向 Kullback-Leibler 散度优化的乐观强化学习。
Neural Netw. 2022 Aug;152:169-180. doi: 10.1016/j.neunet.2022.04.021. Epub 2022 Apr 21.
2
Forward and inverse reinforcement learning sharing network weights and hyperparameters.正向和反向强化学习共享网络权重和超参数。
Neural Netw. 2021 Dec;144:138-153. doi: 10.1016/j.neunet.2021.08.017. Epub 2021 Aug 20.
3
Computation of Kullback-Leibler Divergence in Bayesian Networks.贝叶斯网络中库尔贝克-莱布勒散度的计算。
Entropy (Basel). 2021 Aug 28;23(9):1122. doi: 10.3390/e23091122.
4
A robotic model of hippocampal reverse replay for reinforcement learning.用于强化学习的海马体反向再现机器人模型。
Bioinspir Biomim. 2022 Dec 2;18(1). doi: 10.1088/1748-3190/ac9ffc.
5
Sequential safe feature elimination rule for L-regularized regression with Kullback-Leibler divergence.基于 Kullback-Leibler 散度的 L-正则化回归的序贯安全特征消除规则。
Neural Netw. 2022 Nov;155:523-535. doi: 10.1016/j.neunet.2022.09.008. Epub 2022 Sep 13.
6
Nonlocal total variation based on symmetric Kullback-Leibler divergence for the ultrasound image despeckling.基于对称库尔贝克-莱布勒散度的非局部全变差用于超声图像去噪。
BMC Med Imaging. 2017 Nov 28;17(1):57. doi: 10.1186/s12880-017-0231-7.
7
Estimating the spectrum in computed tomography via Kullback-Leibler divergence constrained optimization.通过基于 Kullback-Leibler 散度约束的最优化方法来估算计算机断层扫描中的频谱。
Med Phys. 2019 Jan;46(1):81-92. doi: 10.1002/mp.13257. Epub 2018 Dec 13.
8
A Satellite Incipient Fault Detection Method Based on Decomposed Kullback-Leibler Divergence.一种基于分解的库尔贝克-莱布勒散度的卫星早期故障检测方法。
Entropy (Basel). 2021 Sep 9;23(9):1194. doi: 10.3390/e23091194.
9
Entropic Regularization of Markov Decision Processes.马尔可夫决策过程的熵正则化
Entropy (Basel). 2019 Jul 10;21(7):674. doi: 10.3390/e21070674.
10
Integration of Reinforcement Learning in a Virtual Robotic Surgical Simulation.强化学习在虚拟机器人手术模拟中的集成。
Surg Innov. 2023 Feb;30(1):94-102. doi: 10.1177/15533506221095298. Epub 2022 May 3.