Department of Computer Science, Tokyo Institute of Technology, 2-12-1 O-okayama, Meguro-ku, Tokyo 152-8552, Japan.
Neural Netw. 2010 Jun;23(5):639-48. doi: 10.1016/j.neunet.2009.12.010. Epub 2010 Jan 11.
Appropriately designing sampling policies is highly important for obtaining better control policies in reinforcement learning. In this paper, we first show that the least-squares policy iteration (LSPI) framework allows us to employ statistical active learning methods for linear regression. Then we propose a design method of good sampling policies for efficient exploration, which is particularly useful when the sampling cost of immediate rewards is high. The effectiveness of the proposed method, which we call active policy iteration (API), is demonstrated through simulations with a batting robot.
恰当的采样策略设计对于强化学习中获得更好的控制策略非常重要。在本文中,我们首先表明最小二乘策略迭代(LSPI)框架允许我们将统计主动学习方法应用于线性回归。然后,我们提出了一种有效的探索采样策略设计方法,当即时奖励的采样成本较高时,该方法特别有用。我们称之为主动策略迭代(API)的方法的有效性通过击球机器人的仿真得到了验证。