• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于松弛保守主义的高效离线强化学习

Efficient Offline Reinforcement Learning With Relaxed Conservatism.

作者信息

Huang Longyang, Dong Botao, Zhang Weidong

出版信息

IEEE Trans Pattern Anal Mach Intell. 2024 Aug;46(8):5260-5272. doi: 10.1109/TPAMI.2024.3364844. Epub 2024 Jul 2.

DOI:10.1109/TPAMI.2024.3364844
PMID:38345962
Abstract

Offline reinforcement learning (RL) aims at learning an optimal policy from a static offline data set, without interacting with the environment. However, the theoretical understanding of the existing offline RL methods needs further studies, among which the conservatism of the learned Q-function and the learned policy is a major issue. In this article, we propose a simple and efficient offline RL with relaxed conservatism (ORL-RC) framework for addressing this concern by learning a Q-function that is close to the true Q-function under the learned policy. The conservatism of learned Q-functions and policies of offline RL methods is analyzed. The analysis results support that the conservatism can lead to policy performance degradation. We establish the convergence results of the proposed ORL-RC, and the bounds of learned Q-functions with and without sampling errors, respectively, suggesting that the gap between the learned Q-function and the true Q-function can be reduced by executing the conservative policy improvement. A practical implementation of ORL-RC is presented and the experimental results on the D4RL benchmark suggest that ORL-RC exhibits superior performance and substantially outperforms existing state-of-the-art offline RL methods.

摘要

离线强化学习(RL)旨在从静态离线数据集中学习最优策略,而无需与环境进行交互。然而,对现有离线RL方法的理论理解仍需进一步研究,其中学习到的Q函数和策略的保守性是一个主要问题。在本文中,我们提出了一种简单高效的具有宽松保守性的离线RL(ORL-RC)框架,通过学习一个在学习到的策略下接近真实Q函数的Q函数来解决这一问题。分析了离线RL方法学习到的Q函数和策略的保守性。分析结果表明,保守性会导致策略性能下降。我们建立了所提出的ORL-RC的收敛结果,以及分别带有和不带有采样误差的学习到的Q函数的边界,这表明通过执行保守的策略改进,可以减小学习到的Q函数与真实Q函数之间的差距。给出了ORL-RC的实际实现,并且在D4RL基准测试上的实验结果表明,ORL-RC表现出卓越的性能,并且显著优于现有的离线RL方法。

相似文献

1
Efficient Offline Reinforcement Learning With Relaxed Conservatism.基于松弛保守主义的高效离线强化学习
IEEE Trans Pattern Anal Mach Intell. 2024 Aug;46(8):5260-5272. doi: 10.1109/TPAMI.2024.3364844. Epub 2024 Jul 2.
2
Mild Policy Evaluation for Offline Actor-Critic.离线策略梯度算法的温和策略评估
IEEE Trans Neural Netw Learn Syst. 2024 Dec;35(12):17950-17964. doi: 10.1109/TNNLS.2023.3309906. Epub 2024 Dec 2.
3
Offline Reinforcement Learning With Behavior Value Regularization.基于行为值正则化的离线强化学习
IEEE Trans Cybern. 2024 Jun;54(6):3692-3704. doi: 10.1109/TCYB.2024.3385910. Epub 2024 May 30.
4
Adaptive pessimism via target Q-value for offline reinforcement learning.基于目标 Q 值的离线强化学习自适应悲观主义。
Neural Netw. 2024 Dec;180:106588. doi: 10.1016/j.neunet.2024.106588. Epub 2024 Aug 5.
5
De-Pessimism Offline Reinforcement Learning via Value Compensation.通过价值补偿实现的离线强化学习去悲观化
IEEE Trans Neural Netw Learn Syst. 2024 Aug 23;PP. doi: 10.1109/TNNLS.2024.3443082.
6
Monotonic Quantile Network for Worst-Case Offline Reinforcement Learning.用于最坏情况离线强化学习的单调分位数网络
IEEE Trans Neural Netw Learn Syst. 2024 Jul;35(7):8954-8968. doi: 10.1109/TNNLS.2022.3217189. Epub 2024 Jul 8.
7
Improving Offline Reinforcement Learning With In-Sample Advantage Regularization for Robot Manipulation.通过样本内优势正则化改进用于机器人操作的离线强化学习
IEEE Trans Neural Netw Learn Syst. 2024 Sep 20;PP. doi: 10.1109/TNNLS.2024.3443102.
8
False Correlation Reduction for Offline Reinforcement Learning.离线强化学习中的虚假相关性降低
IEEE Trans Pattern Anal Mach Intell. 2024 Feb;46(2):1199-1211. doi: 10.1109/TPAMI.2023.3328397. Epub 2024 Jan 8.
9
Model Selection for Offline Reinforcement Learning: Practical Considerations for Healthcare Settings.离线强化学习的模型选择:医疗环境中的实际考量
Proc Mach Learn Res. 2021 Aug;149:2-35.
10
Hundreds Guide Millions: Adaptive Offline Reinforcement Learning With Expert Guidance.数百引导数百万:基于专家指导的自适应离线强化学习
IEEE Trans Neural Netw Learn Syst. 2024 Nov;35(11):16288-16300. doi: 10.1109/TNNLS.2023.3293508. Epub 2024 Oct 29.