Suppr超能文献

基于在线状态抽象和因果变压器模型预测的样本高效深度强化学习

Sample Efficient Deep Reinforcement Learning With Online State Abstraction and Causal Transformer Model Prediction.

作者信息

Lan Yixing, Xu Xin, Fang Qiang, Hao Jianye

出版信息

IEEE Trans Neural Netw Learn Syst. 2024 Nov;35(11):16574-16588. doi: 10.1109/TNNLS.2023.3296642. Epub 2024 Oct 29.

Abstract

Deep reinforcement learning (RL) typically requires a tremendous number of training samples, which are not practical in many applications. State abstraction and world models are two promising approaches for improving sample efficiency in deep RL. However, both state abstraction and world models may degrade the learning performance. In this article, we propose an abstracted model-based policy learning (AMPL) algorithm, which improves the sample efficiency of deep RL. In AMPL, a novel state abstraction method via multistep bisimulation is first developed to learn task-related latent state spaces. Hence, the original Markov decision processes (MDPs) are compressed into abstracted MDPs. Then, a causal transformer model predictor (CTMP) is designed to approximate the abstracted MDPs and generate long-horizon simulated trajectories with a smaller multistep prediction error. Policies are efficiently learned through these trajectories within the abstracted MDPs via a modified multistep soft actor-critic algorithm with a λ -target. Moreover, theoretical analysis shows that the AMPL algorithm can improve sample efficiency during the training process. On Atari games and the DeepMind Control (DMControl) suite, AMPL surpasses current state-of-the-art deep RL algorithms in terms of sample efficiency. Furthermore, DMControl tasks with moving noises are conducted, and the results demonstrate that AMPL is robust to task-irrelevant observational distractors and significantly outperforms the existing approaches.

摘要

深度强化学习(RL)通常需要大量的训练样本,这在许多应用中并不实际。状态抽象和世界模型是提高深度RL样本效率的两种有前景的方法。然而,状态抽象和世界模型都可能降低学习性能。在本文中,我们提出了一种基于抽象模型的策略学习(AMPL)算法,该算法提高了深度RL的样本效率。在AMPL中,首先开发了一种通过多步双模拟的新颖状态抽象方法来学习与任务相关的潜在状态空间。因此,原始的马尔可夫决策过程(MDP)被压缩为抽象的MDP。然后,设计了一种因果变压器模型预测器(CTMP)来近似抽象的MDP,并生成具有较小多步预测误差的长视界模拟轨迹。通过具有λ目标的改进多步软演员-评论家算法,在抽象的MDP内通过这些轨迹有效地学习策略。此外,理论分析表明,AMPL算法可以在训练过程中提高样本效率。在雅达利游戏和深度思维控制(DMControl)套件上,AMPL在样本效率方面超过了当前最先进的深度RL算法。此外,还进行了带有移动噪声的DMControl任务,结果表明AMPL对与任务无关的观测干扰具有鲁棒性,并且明显优于现有方法。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验