IEEE Trans Cybern. 2015 Jan;45(1):77-88. doi: 10.1109/TCYB.2014.2319733. Epub 2014 May 13.
Reinforcement learning (RL) enables an agent to learn behavior by acquiring experience through trial-and-error interactions with a dynamic environment. However, knowledge is usually built from scratch and learning to behave may take a long time. Here, we improve the learning performance by leveraging prior knowledge; that is, the learner shows proper behavior from the beginning of a target task, using the knowledge from a set of known, previously solved, source tasks. In this paper, we argue that building stochastic abstract policies that generalize over past experiences is an effective way to provide such improvement and this generalization outperforms the current practice of using a library of policies. We achieve that contributing with a new algorithm, AbsProb-PI-multiple and a framework for transferring knowledge represented as a stochastic abstract policy in new RL tasks. Stochastic abstract policies offer an effective way to encode knowledge because the abstraction they provide not only generalizes solutions but also facilitates extracting the similarities among tasks. We perform experiments in a robotic navigation environment and analyze the agent's behavior throughout the learning process and also assess the transfer ratio for different amounts of source tasks. We compare our method with the transfer of a library of policies, and experiments show that the use of a generalized policy produces better results by more effectively guiding the agent when learning a target task.
强化学习 (RL) 通过与动态环境的反复交互来获取经验,从而使代理能够学习行为。然而,知识通常是从头开始建立的,并且学习行为可能需要很长时间。在这里,我们通过利用先验知识来提高学习性能;也就是说,学习者从目标任务的一开始就使用来自一组已知的、先前解决的源任务的知识表现出适当的行为。在本文中,我们认为构建泛化过去经验的随机抽象策略是提供这种改进的有效方法,并且这种泛化优于当前使用策略库的做法。我们通过一个新的算法 AbsProb-PI-multiple 和一个用于在新的 RL 任务中转移表示为随机抽象策略的知识的框架来实现这一点。随机抽象策略提供了一种有效的编码知识的方法,因为它们提供的抽象不仅概括了解决方案,而且还便于提取任务之间的相似性。我们在机器人导航环境中进行实验,并在整个学习过程中分析代理的行为,还评估了不同数量源任务的转移比例。我们将我们的方法与策略库的转移进行了比较,实验表明,通过更有效地指导代理学习目标任务,使用广义策略会产生更好的结果。