具有调制赫布型加Q网络架构的深度强化学习

Deep Reinforcement Learning With Modulated Hebbian Plus Q-Network Architecture.

作者信息

Ladosz Pawel, Ben-Iwhiwhu Eseoghene, Dick Jeffery, Ketz Nicholas, Kolouri Soheil, Krichmar Jeffrey L, Pilly Praveen K, Soltoggio Andrea

出版信息

IEEE Trans Neural Netw Learn Syst. 2022 May;33(5):2045-2056. doi: 10.1109/TNNLS.2021.3110281. Epub 2022 May 2.

DOI:10.1109/TNNLS.2021.3110281

PMID:34559664

Abstract

In this article, we consider a subclass of partially observable Markov decision process (POMDP) problems which we termed confounding POMDPs. In these types of POMDPs, temporal difference (TD)-based reinforcement learning (RL) algorithms struggle, as TD error cannot be easily derived from observations. We solve these types of problems using a new bio-inspired neural architecture that combines a modulated Hebbian network (MOHN) with deep Q-network (DQN), which we call modulated Hebbian plus Q-network architecture (MOHQA). The key idea is to use a Hebbian network with rarely correlated bio-inspired neural traces to bridge temporal delays between actions and rewards when confounding observations and sparse rewards result in inaccurate TD errors. In MOHQA, DQN learns low-level features and control, while the MOHN contributes to high-level decisions by associating rewards with past states and actions. Thus, the proposed architecture combines two modules with significantly different learning algorithms, a Hebbian associative network and a classical DQN pipeline, exploiting the advantages of both. Simulations on a set of POMDPs and on the Malmo environment show that the proposed algorithm improved DQN's results and even outperformed control tests with advantage-actor critic (A2C), quantile regression DQN with long short-term memory (QRDQN + LSTM), Monte Carlo policy gradient (REINFORCE), and aggregated memory for reinforcement learning (AMRL) algorithms on most difficult POMDPs with confounding stimuli and sparse rewards.

摘要

在本文中，我们考虑了部分可观测马尔可夫决策过程（POMDP）问题的一个子类，我们将其称为混杂POMDP。在这类POMDP中，基于时间差分（TD）的强化学习（RL）算法面临困难，因为无法轻易从观测中得出TD误差。我们使用一种新的受生物启发的神经架构来解决这类问题，该架构将调制赫布网络（MOHN）与深度Q网络（DQN）相结合，我们称之为调制赫布加Q网络架构（MOHQA）。关键思想是使用具有极少相关生物启发神经痕迹的赫布网络，在混杂观测和稀疏奖励导致不准确的TD误差时，弥合动作与奖励之间的时间延迟。在MOHQA中，DQN学习低级特征和控制，而MOHN通过将奖励与过去的状态和动作相关联来促成高级决策。因此，所提出的架构结合了两个具有显著不同学习算法的模块，一个赫布关联网络和一个经典的DQN管道，利用了两者的优势。在一组POMDP以及马尔默环境上的模拟表明，所提出的算法改进了DQN的结果，甚至在具有混杂刺激和稀疏奖励的最困难POMDP上，优于优势演员评论家（A2C）、带长短期记忆的分位数回归DQN（QRDQN + LSTM）、蒙特卡罗策略梯度（REINFORCE）以及强化学习聚合记忆（AMRL）算法的控制测试。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

具有调制赫布型加Q网络架构的深度强化学习

Deep Reinforcement Learning With Modulated Hebbian Plus Q-Network Architecture.

作者信息

出版信息

相似文献

具有调制赫布型加Q网络架构的深度强化学习

Deep Reinforcement Learning With Modulated Hebbian Plus Q-Network Architecture.

作者信息

出版信息

相似文献