Sakaguchi Yutaka, Takano Mitsuo
Graduate School of Information Systems, University of Electro-Communications, 1-5-1, Chofugaoka, Chofu, Tokyo 182-8585, Japan.
Neural Netw. 2004 Sep;17(7):935-52. doi: 10.1016/j.neunet.2004.05.004.
This article proposes an adaptive action-selection method for a model-free reinforcement learning system, based on the concept of the 'reliability of internal prediction/estimation'. This concept is realized using an internal variable, called the Reliability Index (RI), which estimates the accuracy of the internal estimator. We define this index for a value function of a temporal difference learning system and substitute it for the temperature parameter of the Boltzmann action-selection rule. Accordingly, the weight of exploratory actions adaptively changes depending on the uncertainty of the prediction. We use this idea for tabular and weighted-sum type value functions. Moreover, we use the RI to adjust the learning coefficient in addition to the temperature parameter, meaning that the reliability becomes a general basis for meta-learning. Numerical experiments were performed to examine the behavior of the proposed method. The RI-based Q-learning system demonstrated its features when the adaptive learning coefficient and large RI-discount rate (which indicate how the RI values of future states are reflected in the RI value of the current state) were introduced. Statistical tests confirmed that the algorithm spent more time exploring in the initial phase of learning, but accelerated learning from the midpoint of learning. It is also shown that the proposed method does not work well with the actor-critic models. The limitations of the proposed method and its relationship to relevant research are discussed.
本文基于“内部预测/估计的可靠性”概念,为无模型强化学习系统提出了一种自适应动作选择方法。该概念通过一个名为可靠性指数(RI)的内部变量来实现,该变量估计内部估计器的准确性。我们为时间差分学习系统的值函数定义了这个指数,并用它替代玻尔兹曼动作选择规则的温度参数。因此,探索性动作的权重会根据预测的不确定性自适应地变化。我们将这个想法应用于表格型和加权和型值函数。此外,除了温度参数外,我们还使用RI来调整学习系数,这意味着可靠性成为元学习的一般基础。进行了数值实验以检验所提方法的行为。当引入自适应学习系数和较大的RI折扣率(它表示未来状态的RI值如何反映在当前状态的RI值中)时,基于RI的Q学习系统展示了其特性。统计测试证实,该算法在学习的初始阶段花费更多时间进行探索,但从学习中点开始加速学习。还表明所提方法在演员-评论家模型中效果不佳。讨论了所提方法的局限性及其与相关研究的关系。