• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

针对非典型马尔可夫决策过程的即时回报强化学习。

An immediate-return reinforcement learning for the atypical Markov decision processes.

作者信息

Pan Zebang, Wen Guilin, Tan Zhao, Yin Shan, Hu Xiaoyan

机构信息

State Key Laboratory of Advanced Design and Manufacturing for Vehicle Body, Hunan University, Changsha, Hunan, China.

School of Mechanical Engineering, Yanshan University, Qinhuangdao, Hebei, China.

出版信息

Front Neurorobot. 2022 Dec 13;16:1012427. doi: 10.3389/fnbot.2022.1012427. eCollection 2022.

DOI:10.3389/fnbot.2022.1012427
PMID:36582302
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9793950/
Abstract

The atypical Markov decision processes (MDPs) are decision-making for maximizing the immediate returns in only one state transition. Many complex dynamic problems can be regarded as the atypical MDPs, e.g., football trajectory control, approximations of the compound Poincaré maps, and parameter identification. However, existing deep reinforcement learning (RL) algorithms are designed to maximize long-term returns, causing a waste of computing resources when applied in the atypical MDPs. These existing algorithms are also limited by the estimation error of the value function, leading to a poor policy. To solve such limitations, this paper proposes an immediate-return algorithm for the atypical MDPs with continuous action space by designing an unbiased and low variance target Q-value and a simplified network framework. Then, two examples of atypical MDPs considering the uncertainty are presented to illustrate the performance of the proposed algorithm, i.e., passing the football to a moving player and chipping the football over the human wall. Compared with the existing deep RL algorithms, such as deep deterministic policy gradient and proximal policy optimization, the proposed algorithm shows significant advantages in learning efficiency, the effective rate of control, and computing resource usage.

摘要

非典型马尔可夫决策过程(MDP)是仅在一次状态转移中最大化即时回报的决策方法。许多复杂的动态问题都可以被视为非典型MDP,例如足球轨迹控制、复合庞加莱映射的近似以及参数识别。然而,现有的深度强化学习(RL)算法旨在最大化长期回报,在应用于非典型MDP时会造成计算资源的浪费。这些现有算法还受到价值函数估计误差的限制,导致策略不佳。为了解决这些局限性,本文通过设计一个无偏且低方差的目标Q值和一个简化的网络框架,提出了一种用于具有连续动作空间的非典型MDP的即时回报算法。然后,给出了两个考虑不确定性的非典型MDP示例,以说明所提算法的性能,即把足球传给移动的球员以及将足球踢过人墙。与现有的深度RL算法,如深度确定性策略梯度和近端策略优化相比,所提算法在学习效率、控制有效率和计算资源使用方面显示出显著优势。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8efe/9793950/e9ee6be485b4/fnbot-16-1012427-g0012.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8efe/9793950/4b25be29d0c6/fnbot-16-1012427-g0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8efe/9793950/35f4c6c1fb6f/fnbot-16-1012427-g0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8efe/9793950/2a18e3e5db25/fnbot-16-1012427-g0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8efe/9793950/1b0441eee0ef/fnbot-16-1012427-g0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8efe/9793950/a27db1d008c9/fnbot-16-1012427-g0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8efe/9793950/c55e17a41617/fnbot-16-1012427-g0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8efe/9793950/d6fb95206a87/fnbot-16-1012427-g0007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8efe/9793950/a6628beb5796/fnbot-16-1012427-g0008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8efe/9793950/fe2e7dcb840c/fnbot-16-1012427-g0009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8efe/9793950/e67a044f035c/fnbot-16-1012427-g0010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8efe/9793950/a58da32c0de4/fnbot-16-1012427-g0011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8efe/9793950/e9ee6be485b4/fnbot-16-1012427-g0012.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8efe/9793950/4b25be29d0c6/fnbot-16-1012427-g0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8efe/9793950/35f4c6c1fb6f/fnbot-16-1012427-g0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8efe/9793950/2a18e3e5db25/fnbot-16-1012427-g0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8efe/9793950/1b0441eee0ef/fnbot-16-1012427-g0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8efe/9793950/a27db1d008c9/fnbot-16-1012427-g0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8efe/9793950/c55e17a41617/fnbot-16-1012427-g0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8efe/9793950/d6fb95206a87/fnbot-16-1012427-g0007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8efe/9793950/a6628beb5796/fnbot-16-1012427-g0008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8efe/9793950/fe2e7dcb840c/fnbot-16-1012427-g0009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8efe/9793950/e67a044f035c/fnbot-16-1012427-g0010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8efe/9793950/a58da32c0de4/fnbot-16-1012427-g0011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8efe/9793950/e9ee6be485b4/fnbot-16-1012427-g0012.jpg

相似文献

1
An immediate-return reinforcement learning for the atypical Markov decision processes.针对非典型马尔可夫决策过程的即时回报强化学习。
Front Neurorobot. 2022 Dec 13;16:1012427. doi: 10.3389/fnbot.2022.1012427. eCollection 2022.
2
Optimization of news dissemination push mode by intelligent edge computing technology for deep learning.基于深度学习的智能边缘计算技术对新闻传播推送模式的优化
Sci Rep. 2024 Mar 20;14(1):6671. doi: 10.1038/s41598-024-53859-7.
3
Sample Efficient Deep Reinforcement Learning With Online State Abstraction and Causal Transformer Model Prediction.基于在线状态抽象和因果变压器模型预测的样本高效深度强化学习
IEEE Trans Neural Netw Learn Syst. 2024 Nov;35(11):16574-16588. doi: 10.1109/TNNLS.2023.3296642. Epub 2024 Oct 29.
4
Kernel-based least squares policy iteration for reinforcement learning.用于强化学习的基于核的最小二乘策略迭代
IEEE Trans Neural Netw. 2007 Jul;18(4):973-92. doi: 10.1109/TNN.2007.899161.
5
Parameterized MDPs and Reinforcement Learning Problems-A Maximum Entropy Principle-Based Framework.参数化马尔可夫决策过程和强化学习问题——基于最大熵原理的框架。
IEEE Trans Cybern. 2022 Sep;52(9):9339-9351. doi: 10.1109/TCYB.2021.3102510. Epub 2022 Aug 18.
6
Optimization of anemia treatment in hemodialysis patients via reinforcement learning.通过强化学习优化血液透析患者的贫血治疗。
Artif Intell Med. 2014 Sep;62(1):47-60. doi: 10.1016/j.artmed.2014.07.004. Epub 2014 Jul 19.
7
Joint Optimization for Mobile Edge Computing-Enabled Blockchain Systems: A Deep Reinforcement Learning Approach.移动边缘计算赋能区块链系统的联合优化:一种深度强化学习方法。
Sensors (Basel). 2022 Apr 22;22(9):3217. doi: 10.3390/s22093217.
8
Hierarchical approximate policy iteration with binary-tree state space decomposition.基于二叉树状态空间分解的分层近似策略迭代
IEEE Trans Neural Netw. 2011 Dec;22(12):1863-77. doi: 10.1109/TNN.2011.2168422. Epub 2011 Oct 10.
9
Learning-Based DoS Attack Power Allocation in Multiprocess Systems.
IEEE Trans Neural Netw Learn Syst. 2023 Oct;34(10):8017-8030. doi: 10.1109/TNNLS.2022.3148924. Epub 2023 Oct 5.
10
On Practical Robust Reinforcement Learning: Adjacent Uncertainty Set and Double-Agent Algorithm.论实用稳健强化学习:相邻不确定性集与双智能体算法
IEEE Trans Neural Netw Learn Syst. 2025 Apr;36(4):7696-7710. doi: 10.1109/TNNLS.2024.3385234. Epub 2025 Apr 4.

本文引用的文献

1
Model-Based and Model-Free Replay Mechanisms for Reinforcement Learning in Neurorobotics.神经机器人学中强化学习的基于模型和无模型回放机制
Front Neurorobot. 2022 Jun 24;16:864380. doi: 10.3389/fnbot.2022.864380. eCollection 2022.
2
Deep Reinforcement Learning Based Trajectory Planning Under Uncertain Constraints.基于深度强化学习的不确定约束下轨迹规划
Front Neurorobot. 2022 May 2;16:883562. doi: 10.3389/fnbot.2022.883562. eCollection 2022.
3
Maximal Sprinting Speed of Elite Soccer Players During Training and Matches.精英足球运动员在训练和比赛中的最大冲刺速度。
J Strength Cond Res. 2017 Jun;31(6):1509-1517. doi: 10.1519/JSC.0000000000001642.
4
Human-level control through deep reinforcement learning.通过深度强化学习实现人类水平的控制。
Nature. 2015 Feb 26;518(7540):529-33. doi: 10.1038/nature14236.