• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

通过价值补偿实现的离线强化学习去悲观化

De-Pessimism Offline Reinforcement Learning via Value Compensation.

作者信息

Huang Zhenbo, Zhao Jing, Sun Shiliang

出版信息

IEEE Trans Neural Netw Learn Syst. 2024 Aug 23;PP. doi: 10.1109/TNNLS.2024.3443082.

DOI:10.1109/TNNLS.2024.3443082
PMID:39178073
Abstract

Offline reinforcement learning (RL) has been widely used in practice due to its efficient data utilization, but it still faces the challenge of training vulnerability caused by policy deviation. Existing offline RL methods that add policy constraints or perform conservative Q -value estimation are pessimistic, making the learned policy suboptimal. In this article, we address the pessimism problem by focusing on accurate Q -value estimation. We propose the de-pessimism (DEP) operator to estimate Q values using the optimal Bellman operator or the compensation operator according to whether the actions are in the behavior support set. The compensation operator qualitatively determines the positive or negative nature of out-of-distribution (OOD) actions based on their performance compared with the behavior actions. It leverages differences in state values to compensate for the Q value of positive OOD actions, thereby alleviating pessimism. We theoretically demonstrate the convergence of DEP and its effectiveness in policy improvement. To further advance the practical application, we integrate DEP into the soft actor-critic (SAC) algorithm, yielding the value-compensated de-pessimism offline RL (DoRL-VC). Experimentally, DoRL-VC achieves state-of-the-art (SOTA) performance across mujoco locomotion, Maze 2-D, and challenging Adroit tasks, illustrating the efficacy of DEP in mitigating pessimism.

摘要

离线强化学习(RL)因其高效的数据利用率而在实践中得到广泛应用,但它仍面临着因策略偏差导致的训练脆弱性挑战。现有的添加策略约束或执行保守Q值估计的离线RL方法较为悲观,使得学习到的策略次优。在本文中,我们通过关注准确的Q值估计来解决悲观问题。我们提出了去悲观(DEP)算子,根据动作是否在行为支持集中,使用最优贝尔曼算子或补偿算子来估计Q值。补偿算子根据分布外(OOD)动作与行为动作的性能比较,定性地确定其正负性质。它利用状态值的差异来补偿正OOD动作的Q值,从而减轻悲观情绪。我们从理论上证明了DEP的收敛性及其在策略改进中的有效性。为了进一步推动实际应用,我们将DEP集成到软演员-评论家(SAC)算法中,得到了价值补偿去悲观离线RL(DoRL-VC)。实验表明,DoRL-VC在mujoco运动、二维迷宫和具有挑战性的灵巧任务中均取得了最优(SOTA)性能,说明了DEP在减轻悲观情绪方面的有效性。

相似文献

1
De-Pessimism Offline Reinforcement Learning via Value Compensation.通过价值补偿实现的离线强化学习去悲观化
IEEE Trans Neural Netw Learn Syst. 2024 Aug 23;PP. doi: 10.1109/TNNLS.2024.3443082.
2
Decoupled Prioritized Resampling for Offline RL.用于离线强化学习的解耦优先重采样
IEEE Trans Neural Netw Learn Syst. 2025 Jul;36(7):13094-13108. doi: 10.1109/TNNLS.2024.3488358.
3
Health professionals' experience of teamwork education in acute hospital settings: a systematic review of qualitative literature.医疗专业人员在急症医院环境中团队合作教育的经验:对定性文献的系统综述
JBI Database System Rev Implement Rep. 2016 Apr;14(4):96-137. doi: 10.11124/JBISRIR-2016-1843.
4
A rapid and systematic review of the clinical effectiveness and cost-effectiveness of paclitaxel, docetaxel, gemcitabine and vinorelbine in non-small-cell lung cancer.对紫杉醇、多西他赛、吉西他滨和长春瑞滨在非小细胞肺癌中的临床疗效和成本效益进行的快速系统评价。
Health Technol Assess. 2001;5(32):1-195. doi: 10.3310/hta5320.
5
Management of urinary stones by experts in stone disease (ESD 2025).结石病专家对尿路结石的管理(2025年结石病专家共识)
Arch Ital Urol Androl. 2025 Jun 30;97(2):14085. doi: 10.4081/aiua.2025.14085.
6
Ambulatory Oxygen for Pulmonary Fibrosis (OxyPuF): a randomised controlled trial and acceptability study.用于肺纤维化的门诊氧疗(OxyPuF):一项随机对照试验和可接受性研究。
Health Technol Assess. 2025 Jul 2:1-33. doi: 10.3310/TWKS4194.
7
Adaptive pessimism via target Q-value for offline reinforcement learning.基于目标 Q 值的离线强化学习自适应悲观主义。
Neural Netw. 2024 Dec;180:106588. doi: 10.1016/j.neunet.2024.106588. Epub 2024 Aug 5.
8
Factors that impact on the use of mechanical ventilation weaning protocols in critically ill adults and children: a qualitative evidence-synthesis.影响重症成人和儿童机械通气撤机方案使用的因素:一项定性证据综合分析
Cochrane Database Syst Rev. 2016 Oct 4;10(10):CD011812. doi: 10.1002/14651858.CD011812.pub2.
9
Cost-effectiveness of using prognostic information to select women with breast cancer for adjuvant systemic therapy.利用预后信息为乳腺癌患者选择辅助性全身治疗的成本效益
Health Technol Assess. 2006 Sep;10(34):iii-iv, ix-xi, 1-204. doi: 10.3310/hta10340.
10
"In a State of Flow": A Qualitative Examination of Autistic Adults' Phenomenological Experiences of Task Immersion.“心流状态”:对自闭症成年人任务沉浸现象学体验的质性研究
Autism Adulthood. 2024 Sep 16;6(3):362-373. doi: 10.1089/aut.2023.0032. eCollection 2024 Sep.