Suppr超能文献

情景记忆 - 双智能体 - 评论家双延迟深度确定性策略梯度

Episodic Memory-Double Actor-Critic Twin Delayed Deep Deterministic Policy Gradient.

作者信息

Shu Man, Lü Shuai, Gong Xiaoyu, An Daolong, Li Songlin

机构信息

Key Laboratory of Symbolic Computation and Knowledge Engineering (Jilin University), Ministry of Education, Changchun 130012, China; Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun 130033, China; College of Computer Science and Technology, Jilin University, Changchun 130012, China.

Key Laboratory of Symbolic Computation and Knowledge Engineering (Jilin University), Ministry of Education, Changchun 130012, China; College of Computer Science and Technology, Jilin University, Changchun 130012, China; College of Software, Jilin University, Changchun 130012, China.

出版信息

Neural Netw. 2025 Jul;187:107286. doi: 10.1016/j.neunet.2025.107286. Epub 2025 Feb 27.

Abstract

Existing deep reinforcement learning (DRL) algorithms suffer from the problem of low sample efficiency. Episodic memory allows DRL algorithms to remember and use past experiences with high return, thereby improving sample efficiency. However, due to the high dimensionality of the state-action space in continuous action tasks, previous methods in continuous action tasks often only utilize the information stored in episodic memory, rather than directly employing episodic memory for action selection as done in discrete action tasks. We suppose that episodic memory retains the potential to guide action selection in continuous control tasks. Our objective is to enhance sample efficiency by leveraging episodic memory for action selection in such tasks-either reducing the number of training steps required to achieve comparable performance or enabling the agent to obtain higher rewards within the same number of training steps. To this end, we propose an "Episodic Memory-Double Actor-Critic (EMDAC)" framework, which can use episodic memory for action selection in continuous action tasks. The critics and episodic memory evaluate the value of state-action pairs selected by the two actors to determine the final action. Meanwhile, we design an episodic memory based on a Kalman filter optimizer, which updates using the episodic rewards of collected state-action pairs. The Kalman filter optimizer assigns different weights to experiences collected at different time periods during the memory update process. In our episodic memory, state-action pair clusters are used as indices, recording both the occurrence frequency of these clusters and the value estimates for the corresponding state-action pairs. This enables the estimation of the value of state-action pair clusters by querying the episodic memory. After that, we design intrinsic reward based on the novelty of state-action pairs with episodic memory, defined by the occurrence frequency of state-action pair clusters, to enhance the exploration capability of the agent. Ultimately, we propose an "EMDAC-TD3" algorithm by applying this three modules to Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm within an Actor-Critic framework. Through evaluations in MuJoCo environments within the OpenAI Gym domain, EMDAC-TD3 achieves higher sample efficiency compared to baseline algorithms. EMDAC-TD3 demonstrates superior final performance compared to state-of-the-art episodic control algorithms and advanced Actor-Critic algorithms, by comparing the final rewards, Median, Interquartile Mean, Mean, and Optimality Gap. The final rewards can directly demonstrate the advantages of the algorithms. Based on the final rewards, EMDAC-TD3 achieves an average performance improvement of 11.01% over TD3, surpassing the current state-of-the-art algorithms in the same category.

摘要

现有的深度强化学习(DRL)算法存在样本效率低的问题。情节记忆使DRL算法能够记住并利用具有高回报的过去经验,从而提高样本效率。然而,由于连续动作任务中状态-动作空间的高维度,连续动作任务中的先前方法通常只利用情节记忆中存储的信息,而不是像离散动作任务那样直接将情节记忆用于动作选择。我们假设情节记忆在连续控制任务中具有指导动作选择的潜力。我们的目标是通过在这类任务中利用情节记忆进行动作选择来提高样本效率——要么减少达到可比性能所需的训练步骤数量,要么使智能体在相同数量的训练步骤内获得更高的奖励。为此,我们提出了一种“情节记忆-双智能体-评论家(EMDAC)”框架,它可以在连续动作任务中使用情节记忆进行动作选择。评论家与情节记忆评估两个智能体选择的状态-动作对的价值,以确定最终动作。同时,我们设计了一种基于卡尔曼滤波器优化器的情节记忆,它利用收集到的状态-动作对的情节奖励进行更新。卡尔曼滤波器优化器在记忆更新过程中为不同时间段收集的经验分配不同的权重。在我们的情节记忆中,状态-动作对簇用作索引,记录这些簇的出现频率以及相应状态-动作对的价值估计。这使得能够通过查询情节记忆来估计状态-动作对簇的价值。之后,我们基于具有情节记忆的状态-动作对的新颖性设计内在奖励,由状态-动作对簇的出现频率定义,以增强智能体的探索能力。最终,我们通过在演员-评论家框架内将这三个模块应用于双延迟深度确定性策略梯度(TD3)算法,提出了一种“EMDAC-TD3”算法。通过在OpenAI Gym领域的MuJoCo环境中的评估,与基线算法相比,EMDAC-TD3实现了更高的样本效率。通过比较最终奖励、中位数、四分位数间距均值、均值和最优差距,EMDAC-TD3与最先进的情节控制算法和先进的演员-评论家算法相比,展示了卓越的最终性能。最终奖励可以直接体现算法的优势。基于最终奖励,EMDAC-TD3相对于TD3平均性能提升了11.01%,超过了同一类别的当前最先进算法。

相似文献

2
Stochastic Integrated Actor-Critic for Deep Reinforcement Learning.用于深度强化学习的随机集成演员-评论家算法
IEEE Trans Neural Netw Learn Syst. 2024 May;35(5):6654-6666. doi: 10.1109/TNNLS.2022.3212273. Epub 2024 May 2.
4
Meta attention for Off-Policy Actor-Critic.用于离策略演员-评论家的元注意力机制
Neural Netw. 2023 Jun;163:86-96. doi: 10.1016/j.neunet.2023.03.024. Epub 2023 Mar 28.
5
Reducing Estimation Bias via Triplet-Average Deep Deterministic Policy Gradient.通过三元组平均深度确定性策略梯度减少估计偏差
IEEE Trans Neural Netw Learn Syst. 2020 Nov;31(11):4933-4945. doi: 10.1109/TNNLS.2019.2959129. Epub 2020 Oct 30.
6
Human-in-the-Loop Reinforcement Learning in Continuous-Action Space.连续动作空间中的人在回路强化学习
IEEE Trans Neural Netw Learn Syst. 2024 Nov;35(11):15735-15744. doi: 10.1109/TNNLS.2023.3289315. Epub 2024 Oct 29.
8
Diversity Evolutionary Policy Deep Reinforcement Learning.多样性进化策略深度强化学习。
Comput Intell Neurosci. 2021 Aug 3;2021:5300189. doi: 10.1155/2021/5300189. eCollection 2021.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验