DeepMind, London, UK.
University College London, London, UK.
Nature. 2020 Dec;588(7839):604-609. doi: 10.1038/s41586-020-03051-4. Epub 2020 Dec 23.
Constructing agents with planning capabilities has long been one of the main challenges in the pursuit of artificial intelligence. Tree-based planning methods have enjoyed huge success in challenging domains, such as chess and Go, where a perfect simulator is available. However, in real-world problems, the dynamics governing the environment are often complex and unknown. Here we present the MuZero algorithm, which, by combining a tree-based search with a learned model, achieves superhuman performance in a range of challenging and visually complex domains, without any knowledge of their underlying dynamics. The MuZero algorithm learns an iterable model that produces predictions relevant to planning: the action-selection policy, the value function and the reward. When evaluated on 57 different Atari games-the canonical video game environment for testing artificial intelligence techniques, in which model-based planning approaches have historically struggled-the MuZero algorithm achieved state-of-the-art performance. When evaluated on Go, chess and shogi-canonical environments for high-performance planning-the MuZero algorithm matched, without any knowledge of the game dynamics, the superhuman performance of the AlphaZero algorithm that was supplied with the rules of the game.
构建具有规划能力的代理一直是人工智能追求的主要挑战之一。基于树的规划方法在具有挑战性的领域(如国际象棋和围棋)中取得了巨大的成功,这些领域都有一个完美的模拟器。然而,在现实世界的问题中,控制环境的动态往往是复杂和未知的。在这里,我们提出了 MuZero 算法,它通过将基于树的搜索与学习模型相结合,在一系列具有挑战性和视觉复杂的领域中实现了超人的性能,而无需任何对其底层动态的了解。MuZero 算法学习了一种可迭代的模型,该模型可以生成与规划相关的预测:动作选择策略、价值函数和奖励。当在 57 个不同的 Atari 游戏(用于测试人工智能技术的典型视频游戏环境,基于模型的规划方法在历史上一直难以解决)上进行评估时,MuZero 算法实现了最先进的性能。当在围棋、国际象棋和将棋(高性能规划的典型环境)上进行评估时,MuZero 算法无需任何对游戏动态的了解,就能匹配使用游戏规则的 AlphaZero 算法的超人性能。