针对非典型马尔可夫决策过程的即时回报强化学习。

An immediate-return reinforcement learning for the atypical Markov decision processes.

作者信息

Pan Zebang, Wen Guilin, Tan Zhao, Yin Shan, Hu Xiaoyan

机构信息

State Key Laboratory of Advanced Design and Manufacturing for Vehicle Body, Hunan University, Changsha, Hunan, China.

School of Mechanical Engineering, Yanshan University, Qinhuangdao, Hebei, China.

出版信息

Front Neurorobot. 2022 Dec 13;16:1012427. doi: 10.3389/fnbot.2022.1012427. eCollection 2022.

DOI:10.3389/fnbot.2022.1012427

PMID:36582302

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9793950/

Abstract

The atypical Markov decision processes (MDPs) are decision-making for maximizing the immediate returns in only one state transition. Many complex dynamic problems can be regarded as the atypical MDPs, e.g., football trajectory control, approximations of the compound Poincaré maps, and parameter identification. However, existing deep reinforcement learning (RL) algorithms are designed to maximize long-term returns, causing a waste of computing resources when applied in the atypical MDPs. These existing algorithms are also limited by the estimation error of the value function, leading to a poor policy. To solve such limitations, this paper proposes an immediate-return algorithm for the atypical MDPs with continuous action space by designing an unbiased and low variance target Q-value and a simplified network framework. Then, two examples of atypical MDPs considering the uncertainty are presented to illustrate the performance of the proposed algorithm, i.e., passing the football to a moving player and chipping the football over the human wall. Compared with the existing deep RL algorithms, such as deep deterministic policy gradient and proximal policy optimization, the proposed algorithm shows significant advantages in learning efficiency, the effective rate of control, and computing resource usage.

摘要

非典型马尔可夫决策过程（MDP）是仅在一次状态转移中最大化即时回报的决策方法。许多复杂的动态问题都可以被视为非典型MDP，例如足球轨迹控制、复合庞加莱映射的近似以及参数识别。然而，现有的深度强化学习（RL）算法旨在最大化长期回报，在应用于非典型MDP时会造成计算资源的浪费。这些现有算法还受到价值函数估计误差的限制，导致策略不佳。为了解决这些局限性，本文通过设计一个无偏且低方差的目标Q值和一个简化的网络框架，提出了一种用于具有连续动作空间的非典型MDP的即时回报算法。然后，给出了两个考虑不确定性的非典型MDP示例，以说明所提算法的性能，即把足球传给移动的球员以及将足球踢过人墙。与现有的深度RL算法，如深度确定性策略梯度和近端策略优化相比，所提算法在学习效率、控制有效率和计算资源使用方面显示出显著优势。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

针对非典型马尔可夫决策过程的即时回报强化学习。

An immediate-return reinforcement learning for the atypical Markov decision processes.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

针对非典型马尔可夫决策过程的即时回报强化学习。

An immediate-return reinforcement learning for the atypical Markov decision processes.

作者信息

机构信息

出版信息

相似文献

本文引用的文献