Shu Man, Lü Shuai, Gong Xiaoyu, An Daolong, Li Songlin
Key Laboratory of Symbolic Computation and Knowledge Engineering (Jilin University), Ministry of Education, Changchun 130012, China; Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun 130033, China; College of Computer Science and Technology, Jilin University, Changchun 130012, China.
Key Laboratory of Symbolic Computation and Knowledge Engineering (Jilin University), Ministry of Education, Changchun 130012, China; College of Computer Science and Technology, Jilin University, Changchun 130012, China; College of Software, Jilin University, Changchun 130012, China.
Neural Netw. 2025 Jul;187:107286. doi: 10.1016/j.neunet.2025.107286. Epub 2025 Feb 27.
Existing deep reinforcement learning (DRL) algorithms suffer from the problem of low sample efficiency. Episodic memory allows DRL algorithms to remember and use past experiences with high return, thereby improving sample efficiency. However, due to the high dimensionality of the state-action space in continuous action tasks, previous methods in continuous action tasks often only utilize the information stored in episodic memory, rather than directly employing episodic memory for action selection as done in discrete action tasks. We suppose that episodic memory retains the potential to guide action selection in continuous control tasks. Our objective is to enhance sample efficiency by leveraging episodic memory for action selection in such tasks-either reducing the number of training steps required to achieve comparable performance or enabling the agent to obtain higher rewards within the same number of training steps. To this end, we propose an "Episodic Memory-Double Actor-Critic (EMDAC)" framework, which can use episodic memory for action selection in continuous action tasks. The critics and episodic memory evaluate the value of state-action pairs selected by the two actors to determine the final action. Meanwhile, we design an episodic memory based on a Kalman filter optimizer, which updates using the episodic rewards of collected state-action pairs. The Kalman filter optimizer assigns different weights to experiences collected at different time periods during the memory update process. In our episodic memory, state-action pair clusters are used as indices, recording both the occurrence frequency of these clusters and the value estimates for the corresponding state-action pairs. This enables the estimation of the value of state-action pair clusters by querying the episodic memory. After that, we design intrinsic reward based on the novelty of state-action pairs with episodic memory, defined by the occurrence frequency of state-action pair clusters, to enhance the exploration capability of the agent. Ultimately, we propose an "EMDAC-TD3" algorithm by applying this three modules to Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm within an Actor-Critic framework. Through evaluations in MuJoCo environments within the OpenAI Gym domain, EMDAC-TD3 achieves higher sample efficiency compared to baseline algorithms. EMDAC-TD3 demonstrates superior final performance compared to state-of-the-art episodic control algorithms and advanced Actor-Critic algorithms, by comparing the final rewards, Median, Interquartile Mean, Mean, and Optimality Gap. The final rewards can directly demonstrate the advantages of the algorithms. Based on the final rewards, EMDAC-TD3 achieves an average performance improvement of 11.01% over TD3, surpassing the current state-of-the-art algorithms in the same category.
现有的深度强化学习(DRL)算法存在样本效率低的问题。情节记忆使DRL算法能够记住并利用具有高回报的过去经验,从而提高样本效率。然而,由于连续动作任务中状态-动作空间的高维度,连续动作任务中的先前方法通常只利用情节记忆中存储的信息,而不是像离散动作任务那样直接将情节记忆用于动作选择。我们假设情节记忆在连续控制任务中具有指导动作选择的潜力。我们的目标是通过在这类任务中利用情节记忆进行动作选择来提高样本效率——要么减少达到可比性能所需的训练步骤数量,要么使智能体在相同数量的训练步骤内获得更高的奖励。为此,我们提出了一种“情节记忆-双智能体-评论家(EMDAC)”框架,它可以在连续动作任务中使用情节记忆进行动作选择。评论家与情节记忆评估两个智能体选择的状态-动作对的价值,以确定最终动作。同时,我们设计了一种基于卡尔曼滤波器优化器的情节记忆,它利用收集到的状态-动作对的情节奖励进行更新。卡尔曼滤波器优化器在记忆更新过程中为不同时间段收集的经验分配不同的权重。在我们的情节记忆中,状态-动作对簇用作索引,记录这些簇的出现频率以及相应状态-动作对的价值估计。这使得能够通过查询情节记忆来估计状态-动作对簇的价值。之后,我们基于具有情节记忆的状态-动作对的新颖性设计内在奖励,由状态-动作对簇的出现频率定义,以增强智能体的探索能力。最终,我们通过在演员-评论家框架内将这三个模块应用于双延迟深度确定性策略梯度(TD3)算法,提出了一种“EMDAC-TD3”算法。通过在OpenAI Gym领域的MuJoCo环境中的评估,与基线算法相比,EMDAC-TD3实现了更高的样本效率。通过比较最终奖励、中位数、四分位数间距均值、均值和最优差距,EMDAC-TD3与最先进的情节控制算法和先进的演员-评论家算法相比,展示了卓越的最终性能。最终奖励可以直接体现算法的优势。基于最终奖励,EMDAC-TD3相对于TD3平均性能提升了11.01%,超过了同一类别的当前最先进算法。