IEEE Trans Neural Netw Learn Syst. 2018 Jun;29(6):2080-2098. doi: 10.1109/TNNLS.2018.2812709.
Reinforcement learning in environments with many action-state pairs is challenging. The issue is the number of episodes needed to thoroughly search the policy space. Most conventional heuristics address this search problem in a stochastic manner. This can leave large portions of the policy space unvisited during the early training stages. In this paper, we propose an uncertainty-based, information-theoretic approach for performing guided stochastic searches that more effectively cover the policy space. Our approach is based on the value of information, a criterion that provides the optimal tradeoff between expected costs and the granularity of the search process. The value of information yields a stochastic routine for choosing actions during learning that can explore the policy space in a coarse to fine manner. We augment this criterion with a state-transition uncertainty factor, which guides the search process into previously unexplored regions of the policy space. We evaluate the uncertainty-based value-of-information policies on the games Centipede and Crossy Road. Our results indicate that our approach yields better performing policies in fewer episodes than stochastic-based exploration strategies. We show that the training rate for our approach can be further improved by using the policy cross entropy to guide our criterion's hyperparameter selection.
在具有大量动作-状态对的环境中进行强化学习具有挑战性。问题在于需要多少个情节来彻底搜索策略空间。大多数传统的启发式方法以随机的方式解决这个搜索问题。这可能会导致在早期训练阶段有很大一部分策略空间未被访问。在本文中,我们提出了一种基于不确定性的信息论方法来进行有指导的随机搜索,以更有效地覆盖策略空间。我们的方法基于信息价值,这是一个在预期成本和搜索过程的粒度之间提供最优折衷的标准。信息价值为学习期间选择动作提供了一个随机例程,可以以粗到细的方式探索策略空间。我们将这个标准与状态转换不确定性因素结合起来,该因素将搜索过程引导到策略空间中以前未探索的区域。我们在游戏 Centipede 和 Crossy Road 上评估基于不确定性的信息价值策略。我们的结果表明,与基于随机的探索策略相比,我们的方法在更少的情节中产生了性能更好的策略。我们表明,通过使用策略交叉熵来指导我们的标准的超参数选择,可以进一步提高我们方法的训练速度。