Ladosz Pawel, Ben-Iwhiwhu Eseoghene, Dick Jeffery, Ketz Nicholas, Kolouri Soheil, Krichmar Jeffrey L, Pilly Praveen K, Soltoggio Andrea
IEEE Trans Neural Netw Learn Syst. 2022 May;33(5):2045-2056. doi: 10.1109/TNNLS.2021.3110281. Epub 2022 May 2.
In this article, we consider a subclass of partially observable Markov decision process (POMDP) problems which we termed confounding POMDPs. In these types of POMDPs, temporal difference (TD)-based reinforcement learning (RL) algorithms struggle, as TD error cannot be easily derived from observations. We solve these types of problems using a new bio-inspired neural architecture that combines a modulated Hebbian network (MOHN) with deep Q-network (DQN), which we call modulated Hebbian plus Q-network architecture (MOHQA). The key idea is to use a Hebbian network with rarely correlated bio-inspired neural traces to bridge temporal delays between actions and rewards when confounding observations and sparse rewards result in inaccurate TD errors. In MOHQA, DQN learns low-level features and control, while the MOHN contributes to high-level decisions by associating rewards with past states and actions. Thus, the proposed architecture combines two modules with significantly different learning algorithms, a Hebbian associative network and a classical DQN pipeline, exploiting the advantages of both. Simulations on a set of POMDPs and on the Malmo environment show that the proposed algorithm improved DQN's results and even outperformed control tests with advantage-actor critic (A2C), quantile regression DQN with long short-term memory (QRDQN + LSTM), Monte Carlo policy gradient (REINFORCE), and aggregated memory for reinforcement learning (AMRL) algorithms on most difficult POMDPs with confounding stimuli and sparse rewards.
在本文中,我们考虑了部分可观测马尔可夫决策过程(POMDP)问题的一个子类,我们将其称为混杂POMDP。在这类POMDP中,基于时间差分(TD)的强化学习(RL)算法面临困难,因为无法轻易从观测中得出TD误差。我们使用一种新的受生物启发的神经架构来解决这类问题,该架构将调制赫布网络(MOHN)与深度Q网络(DQN)相结合,我们称之为调制赫布加Q网络架构(MOHQA)。关键思想是使用具有极少相关生物启发神经痕迹的赫布网络,在混杂观测和稀疏奖励导致不准确的TD误差时,弥合动作与奖励之间的时间延迟。在MOHQA中,DQN学习低级特征和控制,而MOHN通过将奖励与过去的状态和动作相关联来促成高级决策。因此,所提出的架构结合了两个具有显著不同学习算法的模块,一个赫布关联网络和一个经典的DQN管道,利用了两者的优势。在一组POMDP以及马尔默环境上的模拟表明,所提出的算法改进了DQN的结果,甚至在具有混杂刺激和稀疏奖励的最困难POMDP上,优于优势演员评论家(A2C)、带长短期记忆的分位数回归DQN(QRDQN + LSTM)、蒙特卡罗策略梯度(REINFORCE)以及强化学习聚合记忆(AMRL)算法的控制测试。