将社会回报纳入强化学习可促进合作。

Incorporating social payoff into reinforcement learning promotes cooperation.

作者信息

Fan Litong, Song Zhao, Wang Lu, Liu Yang, Wang Zhen

机构信息

School of Mechanical Engineering, Northwestern Polytechnical University, Xi'an, Shaanxi 710072, China.

School of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University, Xi'an, Shaanxi 710072, China.

出版信息

Chaos. 2022 Dec;32(12):123140. doi: 10.1063/5.0093996.

DOI:10.1063/5.0093996

PMID:36587319

Abstract

Reinforcement learning has been demonstrated to be an effective approach to investigate the dynamic of strategy updating and the learning process of agents in game theory. Most studies have shown that Q-learning failed to resolve the problem of cooperation in well-mixed populations or homogeneous networks. To this aim, we investigate the self-regarding Q-learning's effect on cooperation in spatial prisoner's dilemma games by incorporating the social payoff. Here, we redefine the reward term of self-regarding Q-learning by involving the social payoff; that is, the reward is defined as a monotonic function of the individual payoff and the social payoff represented by its neighbors' payoff. Numerical simulations reveal that such a framework can facilitate cooperation remarkably because the social payoff ensures agents learn to cooperate toward socially optimal outcomes. Moreover, we find that self-regarding Q-learning is an innovative rule that ensures cooperators coexist with defectors even at high temptations to defection. The investigation of the emergence and stability of the sublattice-ordered structure shows that such a mechanism tends to generate a checkerboard pattern to increase agents' payoff. Finally, the effects of Q-learning parameters are also analyzed, and the robustness of this mechanism is verified on different networks.

摘要

强化学习已被证明是研究博弈论中策略更新动态和智能体学习过程的有效方法。大多数研究表明，Q学习无法解决完全混合群体或同质性网络中的合作问题。为此，我们通过纳入社会收益来研究自利Q学习对空间囚徒困境博弈中合作的影响。在此，我们通过引入社会收益重新定义自利Q学习的奖励项；也就是说，奖励被定义为个体收益以及由其邻居收益表示的社会收益的单调函数。数值模拟表明，这样一个框架能够显著促进合作，因为社会收益确保智能体学会朝着社会最优结果进行合作。此外，我们发现自利Q学习是一种创新规则，即使在存在高背叛诱惑的情况下，也能确保合作者与背叛者共存。对亚晶格有序结构的出现和稳定性的研究表明，这种机制倾向于产生棋盘模式以增加智能体的收益。最后，我们还分析了Q学习参数的影响，并在不同网络上验证了该机制的稳健性。