Tokyo Institute of Technology, Japan.
ATR Computational Neuroscience Labs, Japan.
Neural Netw. 2014 Sep;57:128-40. doi: 10.1016/j.neunet.2014.06.006. Epub 2014 Jun 21.
The goal of reinforcement learning (RL) is to let an agent learn an optimal control policy in an unknown environment so that future expected rewards are maximized. The model-free RL approach directly learns the policy based on data samples. Although using many samples tends to improve the accuracy of policy learning, collecting a large number of samples is often expensive in practice. On the other hand, the model-based RL approach first estimates the transition model of the environment and then learns the policy based on the estimated transition model. Thus, if the transition model is accurately learned from a small amount of data, the model-based approach is a promising alternative to the model-free approach. In this paper, we propose a novel model-based RL method by combining a recently proposed model-free policy search method called policy gradients with parameter-based exploration and the state-of-the-art transition model estimator called least-squares conditional density estimation. Through experiments, we demonstrate the practical usefulness of the proposed method.
强化学习(RL)的目标是让智能体在未知环境中学习最优控制策略,以使未来的预期奖励最大化。无模型 RL 方法直接基于数据样本学习策略。虽然使用大量样本往往可以提高策略学习的准确性,但在实践中收集大量样本通常很昂贵。另一方面,基于模型的 RL 方法首先估计环境的转移模型,然后基于估计的转移模型学习策略。因此,如果能够从少量数据中准确地学习转移模型,那么基于模型的方法就是无模型方法的一种很有前途的替代方法。在本文中,我们通过结合最近提出的无模型策略搜索方法——策略梯度和基于参数的探索以及最先进的转移模型估计器——最小二乘条件密度估计,提出了一种新的基于模型的 RL 方法。通过实验,我们证明了所提出方法的实际有用性。