Sandholm T W, Crites R H
Computer Science Department, University of Massachusetts at Amherst 01003, USA.
Biosystems. 1996;37(1-2):147-66. doi: 10.1016/0303-2647(95)01551-5.
Reinforcement learning (RL) is based on the idea that the tendency to produce an action should be strengthened (reinforced) if it produces favorable results, and weakened if it produces unfavorable results. Q-learning is a recent RL algorithm that does not need a model of its environment and can be used on-line. Therefore, it is well suited for use in repeated games against an unknown opponent. Most RL research has been confined to single-agent settings or to multiagent settings where the agents have totally positively correlated payoffs (team problems) or totally negatively correlated payoffs (zero-sum games). This paper is an empirical study of reinforcement learning in the Iterated Prisoner's Dilemma (IPD), where the agents' payoffs are neither totally positively nor totally negatively correlated. RL is considerably more difficult in such a domain. This paper investigates the ability of a variety of Q-learning agents to play the IPD game against an unknown opponent. In some experiments, the opponent is the fixed strategy Tit-For-Tat, while in others it is another Q-learner. All the Q-learners learned to play optimally against Tit-For-Tat. Playing against another learner was more difficult because the adaptation of the other learner created a non-stationary environment, and because the other learner was not endowed with any a priori knowledge about the IPD game such as a policy designed to encourage cooperation. The learners that were studied varied along three dimensions: the length of history they received as context, the type of memory they employed (lookup tables based on restricted history windows or recurrent neural networks that can theoretically store features from arbitrarily deep in the past), and the exploration schedule they followed. Although all the learners faced difficulties when playing against other learners, agents with longer history windows, lookup table memories, and longer exploration schedules fared best in the IPD games.
强化学习(RL)基于这样一种理念:如果一个动作产生了有利的结果,那么产生该动作的倾向就应该得到加强(强化);如果产生了不利的结果,那么这种倾向就应该被削弱。Q学习是一种最近的强化学习算法,它不需要其环境的模型,并且可以在线使用。因此,它非常适合用于与未知对手进行的重复博弈。大多数强化学习研究都局限于单智能体设置,或者多智能体设置中智能体的收益完全正相关(团队问题)或完全负相关(零和博弈)的情况。本文是对迭代囚徒困境(IPD)中强化学习的实证研究,在该情境中智能体的收益既不是完全正相关也不是完全负相关。在这样一个领域中,强化学习要困难得多。本文研究了各种Q学习智能体与未知对手进行IPD博弈的能力。在一些实验中,对手是固定策略“以牙还牙”,而在其他实验中,对手是另一个Q学习者。所有的Q学习者都学会了最优地对抗“以牙还牙”策略。与另一个学习者对抗则更困难,因为另一个学习者的适应性创造了一个非平稳环境,并且因为另一个学习者没有被赋予任何关于IPD博弈的先验知识,比如旨在鼓励合作的策略。所研究的学习者在三个维度上有所不同:作为上下文他们所接收的历史长度、他们所采用的记忆类型(基于受限历史窗口的查找表或理论上可以存储过去任意深度特征的循环神经网络)以及他们所遵循的探索计划。尽管所有学习者在与其他学习者对抗时都面临困难,但在IPD博弈中,具有更长历史窗口、查找表记忆和更长探索计划的智能体表现最佳。