School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China.
CRISE, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China.
Sensors (Basel). 2021 May 11;21(10):3332. doi: 10.3390/s21103332.
StarCraft is a real-time strategy game that provides a complex environment for AI research. Macromanagement, i.e., selecting appropriate units to build depending on the current state, is one of the most important problems in this game. To reduce the requirements for expert knowledge and enhance the coordination of the systematic bot, we select reinforcement learning (RL) to tackle the problem of macromanagement. We propose a novel deep RL method, Mean Asynchronous Advantage Actor-Critic (MA3C), which computes the approximate expected policy gradient instead of the gradient of sampled action to reduce the variance of the gradient, and encode the history queue with recurrent neural network to tackle the problem of imperfect information. The experimental results show that MA3C achieves a very high rate of winning, approximately 90%, against the weaker opponents and it improves the win rate about 30% against the stronger opponents. We also propose a novel method to visualize and interpret the policy learned by MA3C. Combined with the visualized results and the snapshots of games, we find that the learned macromanagement not only adapts to the game rules and the policy of the opponent bot, but also cooperates well with the other modules of MA3C-Bot.
《星际争霸》是一款实时战略游戏,为人工智能研究提供了一个复杂的环境。宏观操作,即根据当前状态选择合适的单位进行建造,是该游戏中最重要的问题之一。为了降低对专家知识的要求并增强系统机器人的协调性,我们选择强化学习(RL)来解决宏观操作问题。我们提出了一种新颖的深度 RL 方法,即平均异步优势演员评论家(MA3C),它计算近似期望策略梯度,而不是采样动作的梯度,以减少梯度的方差,并使用递归神经网络对历史队列进行编码,以解决信息不完美的问题。实验结果表明,MA3C 对较弱的对手的胜率非常高,约为 90%,对较强的对手的胜率提高了约 30%。我们还提出了一种新的方法来可视化和解释 MA3C 学习的策略。结合可视化结果和游戏的快照,我们发现学习到的宏观操作不仅适应了游戏规则和对手机器人的策略,而且与 MA3C-Bot 的其他模块配合得很好。