• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于在线状态抽象和因果变压器模型预测的样本高效深度强化学习

Sample Efficient Deep Reinforcement Learning With Online State Abstraction and Causal Transformer Model Prediction.

作者信息

Lan Yixing, Xu Xin, Fang Qiang, Hao Jianye

出版信息

IEEE Trans Neural Netw Learn Syst. 2024 Nov;35(11):16574-16588. doi: 10.1109/TNNLS.2023.3296642. Epub 2024 Oct 29.

DOI:10.1109/TNNLS.2023.3296642
PMID:37581972
Abstract

Deep reinforcement learning (RL) typically requires a tremendous number of training samples, which are not practical in many applications. State abstraction and world models are two promising approaches for improving sample efficiency in deep RL. However, both state abstraction and world models may degrade the learning performance. In this article, we propose an abstracted model-based policy learning (AMPL) algorithm, which improves the sample efficiency of deep RL. In AMPL, a novel state abstraction method via multistep bisimulation is first developed to learn task-related latent state spaces. Hence, the original Markov decision processes (MDPs) are compressed into abstracted MDPs. Then, a causal transformer model predictor (CTMP) is designed to approximate the abstracted MDPs and generate long-horizon simulated trajectories with a smaller multistep prediction error. Policies are efficiently learned through these trajectories within the abstracted MDPs via a modified multistep soft actor-critic algorithm with a λ -target. Moreover, theoretical analysis shows that the AMPL algorithm can improve sample efficiency during the training process. On Atari games and the DeepMind Control (DMControl) suite, AMPL surpasses current state-of-the-art deep RL algorithms in terms of sample efficiency. Furthermore, DMControl tasks with moving noises are conducted, and the results demonstrate that AMPL is robust to task-irrelevant observational distractors and significantly outperforms the existing approaches.

摘要

深度强化学习(RL)通常需要大量的训练样本,这在许多应用中并不实际。状态抽象和世界模型是提高深度RL样本效率的两种有前景的方法。然而,状态抽象和世界模型都可能降低学习性能。在本文中,我们提出了一种基于抽象模型的策略学习(AMPL)算法,该算法提高了深度RL的样本效率。在AMPL中,首先开发了一种通过多步双模拟的新颖状态抽象方法来学习与任务相关的潜在状态空间。因此,原始的马尔可夫决策过程(MDP)被压缩为抽象的MDP。然后,设计了一种因果变压器模型预测器(CTMP)来近似抽象的MDP,并生成具有较小多步预测误差的长视界模拟轨迹。通过具有λ目标的改进多步软演员-评论家算法,在抽象的MDP内通过这些轨迹有效地学习策略。此外,理论分析表明,AMPL算法可以在训练过程中提高样本效率。在雅达利游戏和深度思维控制(DMControl)套件上,AMPL在样本效率方面超过了当前最先进的深度RL算法。此外,还进行了带有移动噪声的DMControl任务,结果表明AMPL对与任务无关的观测干扰具有鲁棒性,并且明显优于现有方法。

相似文献

1
Sample Efficient Deep Reinforcement Learning With Online State Abstraction and Causal Transformer Model Prediction.基于在线状态抽象和因果变压器模型预测的样本高效深度强化学习
IEEE Trans Neural Netw Learn Syst. 2024 Nov;35(11):16574-16588. doi: 10.1109/TNNLS.2023.3296642. Epub 2024 Oct 29.
2
State Abstraction via Deep Supervised Hash Learning.
IEEE Trans Neural Netw Learn Syst. 2025 Jul;36(7):13608-13614. doi: 10.1109/TNNLS.2024.3467338.
3
An immediate-return reinforcement learning for the atypical Markov decision processes.针对非典型马尔可夫决策过程的即时回报强化学习。
Front Neurorobot. 2022 Dec 13;16:1012427. doi: 10.3389/fnbot.2022.1012427. eCollection 2022.
4
Hierarchical approximate policy iteration with binary-tree state space decomposition.基于二叉树状态空间分解的分层近似策略迭代
IEEE Trans Neural Netw. 2011 Dec;22(12):1863-77. doi: 10.1109/TNN.2011.2168422. Epub 2011 Oct 10.
5
Stochastic Integrated Actor-Critic for Deep Reinforcement Learning.用于深度强化学习的随机集成演员-评论家算法
IEEE Trans Neural Netw Learn Syst. 2024 May;35(5):6654-6666. doi: 10.1109/TNNLS.2022.3212273. Epub 2024 May 2.
6
Kernel-based least squares policy iteration for reinforcement learning.用于强化学习的基于核的最小二乘策略迭代
IEEE Trans Neural Netw. 2007 Jul;18(4):973-92. doi: 10.1109/TNN.2007.899161.
7
Masked Contrastive Representation Learning for Reinforcement Learning.用于强化学习的掩码对比表示学习
IEEE Trans Pattern Anal Mach Intell. 2023 Mar;45(3):3421-3433. doi: 10.1109/TPAMI.2022.3176413. Epub 2023 Feb 3.
8
Continuous action deep reinforcement learning for propofol dosing during general anesthesia.全身麻醉期间丙泊酚给药的连续动作深度强化学习
Artif Intell Med. 2022 Jan;123:102227. doi: 10.1016/j.artmed.2021.102227. Epub 2021 Dec 2.
9
Decentralized multi-agent reinforcement learning based on best-response policies.基于最佳响应策略的分布式多智能体强化学习
Front Robot AI. 2024 Apr 16;11:1229026. doi: 10.3389/frobt.2024.1229026. eCollection 2024.
10
An actor-critic framework based on deep reinforcement learning for addressing flexible job shop scheduling problems.一种基于深度强化学习的演员-评论家框架,用于解决柔性作业车间调度问题。
Math Biosci Eng. 2024 Jan;21(1):1445-1471. doi: 10.3934/mbe.2024062. Epub 2022 Dec 28.