Sorbonne Université, CNRS, Institut des Systèmes Intelligents et de Robotique, ISIR, Paris, France.
Institut Jean Nicod, Département d'Études Cognitives, École Normale Supérieure, Paris, France.
PLoS One. 2022 Apr 26;17(4):e0266841. doi: 10.1371/journal.pone.0266841. eCollection 2022.
This paper focuses on a class of reinforcement learning problems where significant events are rare and limited to a single positive reward per episode. A typical example is that of an agent who has to choose a partner to cooperate with, while a large number of partners are simply not interested in cooperating, regardless of what the agent has to offer. We address this problem in a continuous state and action space with two different kinds of search methods: a gradient policy search method and a direct policy search method using an evolution strategy. We show that when significant events are rare, gradient information is also scarce, making it difficult for policy gradient search methods to find an optimal policy, with or without a deep neural architecture. On the other hand, we show that direct policy search methods are invariant to the rarity of significant events, which is yet another confirmation of the unique role evolutionary algorithms has to play as a reinforcement learning method.
本文关注一类强化学习问题,其中重大事件很少且每个回合仅限于一个正奖励。一个典型的例子是,一个代理必须选择一个合作伙伴进行合作,而大量的合作伙伴根本不感兴趣合作,无论代理提供什么。我们在具有两种不同搜索方法的连续状态和动作空间中解决了这个问题:一种是梯度策略搜索方法,另一种是使用进化策略的直接策略搜索方法。我们表明,当重大事件很少时,梯度信息也很稀缺,使得策略梯度搜索方法很难找到最佳策略,无论是否使用深度神经网络架构。另一方面,我们表明,直接策略搜索方法对重大事件的稀有性是不变的,这再次证实了进化算法作为强化学习方法所具有的独特作用。