Suppr超能文献

基于奖惩的机器人导航模块化深度强化学习。

Modular deep reinforcement learning from reward and punishment for robot navigation.

机构信息

Department of Brain Robot Interface, ATR Computational Neuroscience Laboratories, 2-2-2 Hikaridai, Seikacho, Soraku-gun, Kyoto 619-0288, Japan.

ContextVision AB, Storgatan 24, 582 23 Linkoping, Sweden.

出版信息

Neural Netw. 2021 Mar;135:115-126. doi: 10.1016/j.neunet.2020.12.001. Epub 2020 Dec 8.

Abstract

Modular Reinforcement Learning decomposes a monolithic task into several tasks with sub-goals and learns each one in parallel to solve the original problem. Such learning patterns can be traced in the brains of animals. Recent evidence in neuroscience shows that animals utilize separate systems for processing rewards and punishments, illuminating a different perspective for modularizing Reinforcement Learning tasks. MaxPain and its deep variant, Deep MaxPain, showed the advances of such dichotomy-based decomposing architecture over conventional Q-learning in terms of safety and learning efficiency. These two methods differ in policy derivation. MaxPain linearly unified the reward and punishment value functions and generated a joint policy based on unified values; Deep MaxPain tackled scaling problems in high-dimensional cases by linearly forming a joint policy from two sub-policies obtained from their value functions. However, the mixing weights in both methods were determined manually, causing inadequate use of the learned modules. In this work, we discuss the signal scaling of reward and punishment related to discounting factor γ, and propose a weak constraint for signaling design. To further exploit the learning models, we propose a state-value dependent weighting scheme that automatically tunes the mixing weights: hard-max and softmax based on a case analysis of Boltzmann distribution. We focus on maze-solving navigation tasks and investigate how two metrics (pain-avoiding and goal-reaching) influence each other's behaviors during learning. We propose a sensor fusion network structure that utilizes lidar and images captured by a monocular camera instead of lidar-only and image-only sensing. Our results, both in the simulation of three types of mazes with different complexities and a real robot experiment of an L-maze on Turtlebot3 Waffle Pi, showed the improvements of our methods.

摘要

模块化强化学习将一个整体任务分解为具有子目标的多个任务,并并行学习每个任务以解决原始问题。这种学习模式可以在动物的大脑中找到踪迹。神经科学的最新证据表明,动物利用分离的系统来处理奖励和惩罚,为模块化强化学习任务提供了不同的视角。MaxPain 及其深度变体 Deep MaxPain 在安全性和学习效率方面显示了这种基于二分法的分解架构相对于传统 Q-learning 的优势。这两种方法在策略推导方面有所不同。MaxPain 将奖励和惩罚值函数线性统一,并根据统一的值生成联合策略;Deep MaxPain 通过从其值函数中获得的两个子策略线性形成联合策略来解决高维情况下的缩放问题。然而,这两种方法中的混合权重都是手动确定的,导致学习模块的使用不足。在这项工作中,我们讨论了与折扣因子γ相关的奖励和惩罚的信号缩放,并提出了一个信号设计的弱约束。为了进一步利用学习模型,我们提出了一种基于状态值的加权方案,该方案自动调整混合权重:基于 Boltzmann 分布的案例分析的硬最大值和 softmax。我们专注于解决迷宫导航任务,并研究在学习过程中两个度量(避免疼痛和达到目标)如何相互影响。我们提出了一种传感器融合网络结构,该结构使用激光雷达和由单目相机捕获的图像,而不是仅使用激光雷达或仅使用图像进行传感。我们的结果,无论是在具有不同复杂度的三种类型的迷宫的模拟中,还是在 Turtlebot3 Waffle Pi 上的 L 型迷宫的真实机器人实验中,都显示了我们方法的改进。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验