Suppr超能文献

深度强化学习中的弱人类偏好监督

Weak Human Preference Supervision for Deep Reinforcement Learning.

作者信息

Cao Zehong, Wong KaiChiu, Lin Chin-Teng

出版信息

IEEE Trans Neural Netw Learn Syst. 2021 Dec;32(12):5369-5378. doi: 10.1109/TNNLS.2021.3084198. Epub 2021 Nov 30.

Abstract

The current reward learning from human preferences could be used to resolve complex reinforcement learning (RL) tasks without access to a reward function by defining a single fixed preference between pairs of trajectory segments. However, the judgment of preferences between trajectories is not dynamic and still requires human input over thousands of iterations. In this study, we proposed a weak human preference supervision framework, for which we developed a human preference scaling model that naturally reflects the human perception of the degree of weak choices between trajectories and established a human-demonstration estimator through supervised learning to generate the predicted preferences for reducing the number of human inputs. The proposed weak human preference supervision framework can effectively solve complex RL tasks and achieve higher cumulative rewards in simulated robot locomotion-MuJoCo games-relative to the single fixed human preferences. Furthermore, our established human-demonstration estimator requires human feedback only for less than 0.01% of the agent's interactions with the environment and significantly reduces the cost of human inputs by up to 30% compared with the existing approaches. To present the flexibility of our approach, we released a video (https://youtu.be/jQPe1OILT0M) showing comparisons of the behaviors of agents trained on different types of human input. We believe that our naturally inspired human preferences with weakly supervised learning are beneficial for precise reward learning and can be applied to state-of-the-art RL systems, such as human-autonomy teaming systems.

摘要

当前从人类偏好中进行奖励学习可用于解决复杂的强化学习(RL)任务,在无法获得奖励函数的情况下,通过定义轨迹段对之间的单一固定偏好来实现。然而,轨迹之间偏好的判断不是动态的,并且在数千次迭代中仍需要人工输入。在本研究中,我们提出了一种弱人类偏好监督框架,为此我们开发了一种人类偏好缩放模型,该模型自然地反映了人类对轨迹之间微弱选择程度的感知,并通过监督学习建立了一个人类示范估计器,以生成预测偏好,从而减少人工输入的数量。相对于单一固定的人类偏好,所提出的弱人类偏好监督框架能够有效解决复杂的RL任务,并在模拟机器人运动——MuJoCo游戏中获得更高的累积奖励。此外,我们建立的人类示范估计器仅在智能体与环境交互的不到0.01%的情况下需要人类反馈,与现有方法相比,显著降低了人工输入成本达30%。为了展示我们方法的灵活性,我们发布了一个视频(https://youtu.be/jQPe1OILT0M),展示了在不同类型人类输入下训练的智能体行为的比较。我们相信,我们受自然启发的具有弱监督学习的人类偏好有利于精确的奖励学习,并且可以应用于最先进的RL系统,如人类自主协作系统。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验