Tiganj Zoran, Gershman Samuel J, Sederberg Per B, Howard Marc W
Center for Memory and Brain, Department of Psychological and Brain Sciences, Boston, MA 02215, U.S.A.
Department of Psychology and Center for Brain Science, Harvard University, Cambridge, MA 02138, U.S.A.
Neural Comput. 2019 Apr;31(4):681-709. doi: 10.1162/neco_a_01171. Epub 2019 Feb 14.
Natural learners must compute an estimate of future outcomes that follow from a stimulus in continuous time. Widely used reinforcement learning algorithms discretize continuous time and estimate either transition functions from one step to the next (model-based algorithms) or a scalar value of exponentially discounted future reward using the Bellman equation (model-free algorithms). An important drawback of model-based algorithms is that computational cost grows linearly with the amount of time to be simulated. An important drawback of model-free algorithms is the need to select a timescale required for exponential discounting. We present a computational mechanism, developed based on work in psychology and neuroscience, for computing a scale-invariant timeline of future outcomes. This mechanism efficiently computes an estimate of inputs as a function of future time on a logarithmically compressed scale and can be used to generate a scale-invariant power-law-discounted estimate of expected future reward. The representation of future time retains information about what will happen when. The entire timeline can be constructed in a single parallel operation that generates concrete behavioral and neural predictions. This computational mechanism could be incorporated into future reinforcement learning algorithms.
自然学习者必须在连续时间内计算由刺激所导致的未来结果的估计值。广泛使用的强化学习算法将连续时间离散化,并估计从一个步骤到下一个步骤的转移函数(基于模型的算法),或者使用贝尔曼方程估计指数贴现未来奖励的标量值(无模型算法)。基于模型的算法的一个重要缺点是计算成本会随着要模拟的时间量呈线性增长。无模型算法的一个重要缺点是需要选择指数贴现所需的时间尺度。我们基于心理学和神经科学的研究成果提出了一种计算机制,用于计算未来结果的尺度不变时间线。这种机制能够在对数压缩尺度上高效地计算作为未来时间函数的输入估计值,并且可用于生成预期未来奖励的尺度不变幂律贴现估计值。未来时间的表示保留了关于何时会发生何事的信息。整个时间线可以通过一次并行操作构建出来,从而产生具体的行为和神经预测。这种计算机制可以被纳入未来的强化学习算法中。