• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

真实边界近端策略优化。

Authentic Boundary Proximal Policy Optimization.

出版信息

IEEE Trans Cybern. 2022 Sep;52(9):9428-9438. doi: 10.1109/TCYB.2021.3051456. Epub 2022 Aug 18.

DOI:10.1109/TCYB.2021.3051456
PMID:33705327
Abstract

In recent years, the proximal policy optimization (PPO) algorithm has received considerable attention because of its excellent performance in many challenging tasks. However, there is still a large space for theoretical explanation of the mechanism of PPO's horizontal clipping operation, which is a key means to improve the performance of PPO. In addition, while PPO is inspired by the learning theory of trust region policy optimization (TRPO), the theoretical connection between PPO's clipping operation and TRPO's trust region constraint has not been well studied. In this article, we first analyze the effect of PPO's clipping operation on the objective function of conservative policy iteration, and strictly give the theoretical relationship between PPO and TRPO. Then, a novel first-order policy gradient algorithm called authentic boundary PPO (ABPPO) is proposed, which is based on the authentic boundary setting rule. To ensure the difference between the new and old policies is better kept within the clipping range, by borrowing the idea of ABPPO, we proposed two novel improved PPO algorithms called rollback mechanism-based ABPPO (RMABPPO) and penalized point policy difference-based ABPPO (P3DABPPO), which are based on the ideas of rollback clipping and penalized point policy difference, respectively. Experiments on the continuous robotic control tasks implemented in MuJoCo show that our proposed improved PPO algorithms can effectively improve the learning stability and accelerate the learning speed compared with the original PPO.

摘要

近年来,近端策略优化(PPO)算法因其在许多挑战性任务中的出色表现而受到广泛关注。然而,对于 PPO 的水平裁剪操作的机制,仍然有很大的理论解释空间,这是提高 PPO 性能的关键手段。此外,虽然 PPO 是受到信任区域策略优化(TRPO)的学习理论启发的,但 PPO 的裁剪操作和 TRPO 的信任区域约束之间的理论联系还没有得到很好的研究。在本文中,我们首先分析了 PPO 的裁剪操作对保守策略迭代目标函数的影响,并严格给出了 PPO 和 TRPO 之间的理论关系。然后,我们提出了一种新的一阶策略梯度算法,称为真实边界 PPO(ABPPO),它基于真实边界设置规则。为了确保新老策略之间的差异更好地保持在裁剪范围内,通过借鉴 ABPPO 的思想,我们提出了两种新的改进 PPO 算法,称为基于回滚机制的 ABPPO(RMABPPO)和基于惩罚点策略差异的 ABPPO(P3DABPPO),它们分别基于回滚裁剪和惩罚点策略差异的思想。在 MuJoCo 中实现的连续机器人控制任务上的实验表明,与原始 PPO 相比,我们提出的改进 PPO 算法可以有效地提高学习稳定性并加速学习速度。

相似文献

1
Authentic Boundary Proximal Policy Optimization.真实边界近端策略优化。
IEEE Trans Cybern. 2022 Sep;52(9):9428-9438. doi: 10.1109/TCYB.2021.3051456. Epub 2022 Aug 18.
2
Quantum architecture search via truly proximal policy optimization.通过真正的近端策略优化进行量子体系结构搜索。
Sci Rep. 2023 Mar 29;13(1):5157. doi: 10.1038/s41598-023-32349-2.
3
An Off-Policy Trust Region Policy Optimization Method With Monotonic Improvement Guarantee for Deep Reinforcement Learning.一种具有单调改进保证的深度强化学习离策略信赖域策略优化方法
IEEE Trans Neural Netw Learn Syst. 2022 May;33(5):2223-2235. doi: 10.1109/TNNLS.2020.3044196. Epub 2022 May 2.
4
Graph-Attention-Based Casual Discovery With Trust Region-Navigated Clipping Policy Optimization.基于图注意力的随机发现与信任域导航裁剪策略优化。
IEEE Trans Cybern. 2023 Apr;53(4):2311-2324. doi: 10.1109/TCYB.2021.3116762. Epub 2023 Mar 16.
5
New Insights into the Inhibition of Hesperetin on Polyphenol Oxidase: Inhibitory Kinetics, Binding Characteristics, Conformational Change and Computational Simulation.橙皮素对多酚氧化酶抑制作用的新见解:抑制动力学、结合特性、构象变化及计算模拟
Foods. 2023 Feb 20;12(4):905. doi: 10.3390/foods12040905.
6
An Improved Distributed Sampling PPO Algorithm Based on Beta Policy for Continuous Global Path Planning Scheme.基于贝塔策略的改进分布式采样 PPO 算法在连续全局路径规划方案中的应用。
Sensors (Basel). 2023 Jul 2;23(13):6101. doi: 10.3390/s23136101.
7
Differential cost analysis: judging a PPO's feasibility.差异成本分析:评估优先提供者组织的可行性。
Healthc Financ Manage. 1986 May;40(5):44-51.
8
Relative Entropy of Correct Proximal Policy Optimization Algorithms with Modified Penalty Factor in Complex Environment.复杂环境下具有修正惩罚因子的正确近端策略优化算法的相对熵
Entropy (Basel). 2022 Mar 22;24(4):440. doi: 10.3390/e24040440.
9
Combating silent PPOs.
Healthc Financ Manage. 1998 Feb;52(2):44-5.
10
An off-policy multi-agent stochastic policy gradient algorithm for cooperative continuous control.一种用于合作连续控制的离策略多智能体随机策略梯度算法。
Neural Netw. 2024 Feb;170:610-621. doi: 10.1016/j.neunet.2023.11.046. Epub 2023 Nov 23.

引用本文的文献

1
Generalized Policy Improvement Algorithms with Theoretically Supported Sample Reuse.具有理论支持样本复用的广义策略改进算法
IEEE Trans Automat Contr. 2025 Feb;70(2):1236-1243. doi: 10.1109/tac.2024.3454011. Epub 2024 Sep 3.
2
A collaborative inference strategy for medical image diagnosis in mobile edge computing environment.移动边缘计算环境下医学图像诊断的协作推理策略
PeerJ Comput Sci. 2025 Mar 5;11:e2708. doi: 10.7717/peerj-cs.2708. eCollection 2025.