Graduate School of Information Science and Technology, The University of Tokyo, 7-3-1 Hongo, Bunkyo, Tokyo, 113-8656, Japan.
Bull Math Biol. 2009 Nov;71(8):1818-50. doi: 10.1007/s11538-009-9424-8. Epub 2009 May 29.
Direct reciprocity is a chief mechanism of mutual cooperation in social dilemma. Agents cooperate if future interactions with the same opponents are highly likely. Direct reciprocity has been explored mostly by evolutionary game theory based on natural selection. Our daily experience tells, however, that real social agents including humans learn to cooperate based on experience. In this paper, we analyze a reinforcement learning model called temporal difference learning and study its performance in the iterated Prisoner's Dilemma game. Temporal difference learning is unique among a variety of learning models in that it inherently aims at increasing future payoffs, not immediate ones. It also has a neural basis. We analytically and numerically show that learners with only two internal states properly learn to cooperate with retaliatory players and to defect against unconditional cooperators and defectors. Four-state learners are more capable of achieving a high payoff against various opponents. Moreover, we numerically show that four-state learners can learn to establish mutual cooperation for sufficiently small learning rates.
直接互惠是社会困境中相互合作的主要机制。如果未来与同一对手的互动极有可能发生,那么代理人就会合作。直接互惠主要是通过基于自然选择的进化博弈论来探索的。然而,我们的日常经验表明,包括人类在内的实际社会代理人是基于经验学会合作的。在本文中,我们分析了一种称为时间差分学习的强化学习模型,并研究了它在迭代囚徒困境游戏中的表现。时间差分学习在各种学习模型中是独一无二的,因为它本质上旨在提高未来的收益,而不是当前的收益。它也有神经基础。我们从理论和数值上证明,只有两种内部状态的学习者可以正确地学会与报复性玩家合作,学会对无条件合作者和背叛者进行背叛。具有四个状态的学习者能够更好地应对各种对手,获得更高的收益。此外,我们从数值上表明,四个状态的学习者可以学习以较小的学习率建立相互合作。