Navarro-Guerrero Nicolás, Lowe Robert J, Wermter Stefan
Knowledge Technology, Informatics Department, University of Hamburg, Hamburg, Germany.
Division of Cognition and Communication, Department of Applied IT, University of Gothenburg, Gothenburg, Sweden.
Front Neurorobot. 2017 Apr 3;11:10. doi: 10.3389/fnbot.2017.00010. eCollection 2017.
Both nociception and punishment signals have been used in robotics. However, the potential for using these negatively valenced types of reinforcement learning signals for robot learning has not been exploited in detail yet. Nociceptive signals are primarily used as triggers of preprogrammed action sequences. Punishment signals are typically disembodied, i.e., with no or little relation to the agent-intrinsic limitations, and they are often used to impose behavioral constraints. Here, we provide an alternative approach for nociceptive signals as drivers of learning rather than simple triggers of preprogrammed behavior. Explicitly, we use nociception to expand the state space while we use punishment as a negative reinforcement learning signal. We compare the performance-in terms of task error, the amount of perceived nociception, and length of learned action sequences-of different neural networks imbued with punishment-based reinforcement signals for inverse kinematic learning. We contrast the performance of a version of the neural network that receives nociceptive inputs to that without such a process. Furthermore, we provide evidence that nociception can improve learning-making the algorithm more robust against network initializations-as well as behavioral performance by reducing the task error, perceived nociception, and length of learned action sequences. Moreover, we provide evidence that punishment, at least as typically used within reinforcement learning applications, may be detrimental in all relevant metrics.
伤害感受信号和惩罚信号都已应用于机器人技术中。然而,将这些负价类型的强化学习信号用于机器人学习的潜力尚未得到详细发掘。伤害感受信号主要用作预编程动作序列的触发因素。惩罚信号通常是脱离实体的,即与智能体的内在局限性没有或几乎没有关系,并且它们经常被用于施加行为约束。在此,我们提供了一种将伤害感受信号用作学习驱动因素而非预编程行为简单触发因素的替代方法。具体而言,我们利用伤害感受来扩展状态空间,同时将惩罚用作负强化学习信号。我们比较了不同神经网络在逆运动学学习中基于惩罚的强化信号下的性能,包括任务误差、感知到的伤害感受量以及学习到的动作序列长度。我们将接收伤害感受输入的神经网络版本的性能与没有该过程的版本进行对比。此外,我们提供证据表明,伤害感受可以改善学习——使算法对网络初始化更具鲁棒性——以及通过减少任务误差、感知到的伤害感受和学习到的动作序列长度来提高行为表现。而且,我们提供证据表明,至少在强化学习应用中通常使用的惩罚,在所有相关指标中可能是有害的。