强化学习中时间差分算法的生物学实现：对奥赖利等人（2007年）的理论评论。

Biological implementation of the temporal difference algorithm for reinforcement learning: theoretical comment on O'Reilly et al. (2007).

作者信息

Houk James C

机构信息

Department of Physiology, Institute for Neuroscience, Feinberg School of Medicine, Northwestern University, Chicago, IL 60611, USA.

出版信息

Behav Neurosci. 2007 Feb;121(1):231-2. doi: 10.1037/0735-7044.121.1.231.

DOI:10.1037/0735-7044.121.1.231

PMID:17324068

Abstract

The ability to survive in the world depends critically on the brain's capacity to detect earlier and earlier predictors of reward or punishment. The dominant theoretical perspective for understanding this capacity has been the temporal difference (TD) algorithm for reinforcement learning. In this issue of Behavioral Neuroscience, R. C. O'Reilly, M. J. Frank, T. E. Hazy, and B. Watz (2007) propose a new model dubbed primary value and learned value (PVLV) that is simpler than TD, and they claimed that it is biologically more realistic. In this commentary, the author suggests some slight modifications of a previous biological implementation of TD instead of adopting the new PVLV algorithm.

摘要

在世界上生存的能力严重依赖于大脑检测奖励或惩罚的早期预测因素的能力。理解这种能力的主要理论观点是强化学习的时间差分（TD）算法。在本期《行为神经科学》中，R.C.奥赖利、M.J.弗兰克、T.E.哈齐和B.瓦茨（2007年）提出了一种名为初级价值和学习价值（PVLV）的新模型，该模型比TD更简单，他们声称其在生物学上更现实。在这篇评论中，作者建议对之前TD的生物学实现进行一些微调，而不是采用新的PVLV算法。