IEEE Trans Neural Netw Learn Syst. 2017 Aug;28(8):1814-1826. doi: 10.1109/TNNLS.2016.2543000. Epub 2016 May 4.
Learning from demonstrations is a paradigm by which an apprentice agent learns a control policy for a dynamic environment by observing demonstrations delivered by an expert agent. It is usually implemented as either imitation learning (IL) or inverse reinforcement learning (IRL) in the literature. On the one hand, IRL is a paradigm relying on the Markov decision processes, where the goal of the apprentice agent is to find a reward function from the expert demonstrations that could explain the expert behavior. On the other hand, IL consists in directly generalizing the expert strategy, observed in the demonstrations, to unvisited states (and it is therefore close to classification, when there is a finite set of possible decisions). While these two visions are often considered as opposite to each other, the purpose of this paper is to exhibit a formal link between these approaches from which new algorithms can be derived. We show that IL and IRL can be redefined in a way that they are equivalent, in the sense that there exists an explicit bijective operator (namely, the inverse optimal Bellman operator) between their respective spaces of solutions. To do so, we introduce the set-policy framework that creates a clear link between the IL and the IRL. As a result, the IL and IRL solutions making the best of both worlds are obtained. In addition, it is a unifying framework from which existing IL and IRL algorithms can be derived and which opens the way for the IL methods able to deal with the environment's dynamics. Finally, the IRL algorithms derived from the set-policy framework are compared with the algorithms belonging to the more common trajectory-matching family. Experiments demonstrate that the set-policy-based algorithms outperform both the standard IRL and IL ones and result in more robust solutions.
从演示中学习是一种范例,学徒代理通过观察专家代理提供的演示来学习动态环境的控制策略。在文献中,它通常被实现为模仿学习 (IL) 或逆强化学习 (IRL)。一方面,IRL 是一种依赖马尔可夫决策过程的范例,学徒代理的目标是从专家演示中找到一个可以解释专家行为的奖励函数。另一方面,IL 包括直接推广专家策略,观察到在演示中,到未访问的状态(并且因此在存在有限数量的可能决策时接近分类)。虽然这两种观点通常被认为是相互对立的,但本文的目的是展示这些方法之间的正式联系,从中可以推导出新的算法。我们表明,IL 和 IRL 可以以一种等价的方式重新定义,即在它们各自的解空间之间存在显式的双射算子(即逆最优贝尔曼算子)。为此,我们引入了集策略框架,该框架在 IL 和 IRL 之间建立了明确的联系。结果,获得了充分利用这两种方法的最佳 IL 和 IRL 解决方案。此外,它是一个统一的框架,可以从中推导出现有的 IL 和 IRL 算法,并为能够处理环境动态的 IL 方法开辟道路。最后,从集策略框架导出的 IRL 算法与属于更常见的轨迹匹配族的算法进行了比较。实验表明,基于集策略的算法优于标准的 IRL 和 IL 算法,并产生更稳健的解决方案。