• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

深度强化学习中的弱人类偏好监督

Weak Human Preference Supervision for Deep Reinforcement Learning.

作者信息

Cao Zehong, Wong KaiChiu, Lin Chin-Teng

出版信息

IEEE Trans Neural Netw Learn Syst. 2021 Dec;32(12):5369-5378. doi: 10.1109/TNNLS.2021.3084198. Epub 2021 Nov 30.

DOI:10.1109/TNNLS.2021.3084198
PMID:34101604
Abstract

The current reward learning from human preferences could be used to resolve complex reinforcement learning (RL) tasks without access to a reward function by defining a single fixed preference between pairs of trajectory segments. However, the judgment of preferences between trajectories is not dynamic and still requires human input over thousands of iterations. In this study, we proposed a weak human preference supervision framework, for which we developed a human preference scaling model that naturally reflects the human perception of the degree of weak choices between trajectories and established a human-demonstration estimator through supervised learning to generate the predicted preferences for reducing the number of human inputs. The proposed weak human preference supervision framework can effectively solve complex RL tasks and achieve higher cumulative rewards in simulated robot locomotion-MuJoCo games-relative to the single fixed human preferences. Furthermore, our established human-demonstration estimator requires human feedback only for less than 0.01% of the agent's interactions with the environment and significantly reduces the cost of human inputs by up to 30% compared with the existing approaches. To present the flexibility of our approach, we released a video (https://youtu.be/jQPe1OILT0M) showing comparisons of the behaviors of agents trained on different types of human input. We believe that our naturally inspired human preferences with weakly supervised learning are beneficial for precise reward learning and can be applied to state-of-the-art RL systems, such as human-autonomy teaming systems.

摘要

当前从人类偏好中进行奖励学习可用于解决复杂的强化学习(RL)任务,在无法获得奖励函数的情况下,通过定义轨迹段对之间的单一固定偏好来实现。然而,轨迹之间偏好的判断不是动态的,并且在数千次迭代中仍需要人工输入。在本研究中,我们提出了一种弱人类偏好监督框架,为此我们开发了一种人类偏好缩放模型,该模型自然地反映了人类对轨迹之间微弱选择程度的感知,并通过监督学习建立了一个人类示范估计器,以生成预测偏好,从而减少人工输入的数量。相对于单一固定的人类偏好,所提出的弱人类偏好监督框架能够有效解决复杂的RL任务,并在模拟机器人运动——MuJoCo游戏中获得更高的累积奖励。此外,我们建立的人类示范估计器仅在智能体与环境交互的不到0.01%的情况下需要人类反馈,与现有方法相比,显著降低了人工输入成本达30%。为了展示我们方法的灵活性,我们发布了一个视频(https://youtu.be/jQPe1OILT0M),展示了在不同类型人类输入下训练的智能体行为的比较。我们相信,我们受自然启发的具有弱监督学习的人类偏好有利于精确的奖励学习,并且可以应用于最先进的RL系统,如人类自主协作系统。

相似文献

1
Weak Human Preference Supervision for Deep Reinforcement Learning.深度强化学习中的弱人类偏好监督
IEEE Trans Neural Netw Learn Syst. 2021 Dec;32(12):5369-5378. doi: 10.1109/TNNLS.2021.3084198. Epub 2021 Nov 30.
2
Human locomotion with reinforcement learning using bioinspired reward reshaping strategies.基于生物启发式奖励重塑策略的强化学习的人类运动。
Med Biol Eng Comput. 2021 Jan;59(1):243-256. doi: 10.1007/s11517-020-02309-3. Epub 2021 Jan 8.
3
Neuro-Inspired Reinforcement Learning to Improve Trajectory Prediction in Reward-Guided Behavior.神经启发式强化学习改进奖励导向行为中的轨迹预测。
Int J Neural Syst. 2022 Sep;32(9):2250038. doi: 10.1142/S0129065722500381. Epub 2022 Aug 19.
4
Inertia-Constrained Reinforcement Learning to Enhance Human Motor Control Modeling.惯性约束强化学习增强人类运动控制建模。
Sensors (Basel). 2023 Mar 1;23(5):2698. doi: 10.3390/s23052698.
5
Nutrient-Sensitive Reinforcement Learning in Monkeys.猴子的营养敏感强化学习。
J Neurosci. 2023 Mar 8;43(10):1714-1730. doi: 10.1523/JNEUROSCI.0752-22.2022. Epub 2023 Jan 20.
6
Exploration in neo-Hebbian reinforcement learning: Computational approaches to the exploration-exploitation balance with bio-inspired neural networks.神经拟态强化学习探索:基于生物启发神经网络的探索-利用平衡计算方法。
Neural Netw. 2022 Jul;151:16-33. doi: 10.1016/j.neunet.2022.03.021. Epub 2022 Mar 23.
7
Combining STDP and binary networks for reinforcement learning from images and sparse rewards.结合 STDP 和二进制网络,从图像和稀疏奖励中进行强化学习。
Neural Netw. 2021 Dec;144:496-506. doi: 10.1016/j.neunet.2021.09.010. Epub 2021 Sep 17.
8
Training an Actor-Critic Reinforcement Learning Controller for Arm Movement Using Human-Generated Rewards.使用人类生成的奖励训练用于手臂运动的 Actor-Critic 强化学习控制器。
IEEE Trans Neural Syst Rehabil Eng. 2017 Oct;25(10):1892-1905. doi: 10.1109/TNSRE.2017.2700395. Epub 2017 May 2.
9
Selective particle attention: Rapidly and flexibly selecting features for deep reinforcement learning.选择性粒子注意:快速灵活地为深度强化学习选择特征。
Neural Netw. 2022 Jun;150:408-421. doi: 10.1016/j.neunet.2022.03.015. Epub 2022 Mar 17.
10
Modular deep reinforcement learning from reward and punishment for robot navigation.基于奖惩的机器人导航模块化深度强化学习。
Neural Netw. 2021 Mar;135:115-126. doi: 10.1016/j.neunet.2020.12.001. Epub 2020 Dec 8.

引用本文的文献

1
Large Language Models in Oncology: Revolution or Cause for Concern?大语言模型在肿瘤学中的应用:是革命还是值得关注的问题?
Curr Oncol. 2024 Mar 29;31(4):1817-1830. doi: 10.3390/curroncol31040137.
2
The Breakthrough of Large Language Models Release for Medical Applications: 1-Year Timeline and Perspectives.大语言模型在医疗应用方面的突破:1 年时间线与展望。
J Med Syst. 2024 Feb 17;48(1):22. doi: 10.1007/s10916-024-02045-3.
3
IoT-Based Reinforcement Learning Using Probabilistic Model for Determining Extensive Exploration through Computational Intelligence for Next-Generation Techniques.
基于物联网的强化学习使用概率模型通过计算智能确定广泛探索,以用于下一代技术。
Comput Intell Neurosci. 2023 Oct 10;2023:5113417. doi: 10.1155/2023/5113417. eCollection 2023.
4
Deep feature selection using local search embedded social ski-driver optimization algorithm for breast cancer detection in mammograms.基于嵌入局部搜索的社会滑雪者优化算法的深度特征选择用于乳腺钼靶图像中的乳腺癌检测
Neural Comput Appl. 2023;35(7):5479-5499. doi: 10.1007/s00521-022-07895-x. Epub 2022 Nov 5.