Pan Zebang, Wen Guilin, Tan Zhao, Yin Shan, Hu Xiaoyan
State Key Laboratory of Advanced Design and Manufacturing for Vehicle Body, Hunan University, Changsha, Hunan, China.
School of Mechanical Engineering, Yanshan University, Qinhuangdao, Hebei, China.
Front Neurorobot. 2022 Dec 13;16:1012427. doi: 10.3389/fnbot.2022.1012427. eCollection 2022.
The atypical Markov decision processes (MDPs) are decision-making for maximizing the immediate returns in only one state transition. Many complex dynamic problems can be regarded as the atypical MDPs, e.g., football trajectory control, approximations of the compound Poincaré maps, and parameter identification. However, existing deep reinforcement learning (RL) algorithms are designed to maximize long-term returns, causing a waste of computing resources when applied in the atypical MDPs. These existing algorithms are also limited by the estimation error of the value function, leading to a poor policy. To solve such limitations, this paper proposes an immediate-return algorithm for the atypical MDPs with continuous action space by designing an unbiased and low variance target Q-value and a simplified network framework. Then, two examples of atypical MDPs considering the uncertainty are presented to illustrate the performance of the proposed algorithm, i.e., passing the football to a moving player and chipping the football over the human wall. Compared with the existing deep RL algorithms, such as deep deterministic policy gradient and proximal policy optimization, the proposed algorithm shows significant advantages in learning efficiency, the effective rate of control, and computing resource usage.
非典型马尔可夫决策过程(MDP)是仅在一次状态转移中最大化即时回报的决策方法。许多复杂的动态问题都可以被视为非典型MDP,例如足球轨迹控制、复合庞加莱映射的近似以及参数识别。然而,现有的深度强化学习(RL)算法旨在最大化长期回报,在应用于非典型MDP时会造成计算资源的浪费。这些现有算法还受到价值函数估计误差的限制,导致策略不佳。为了解决这些局限性,本文通过设计一个无偏且低方差的目标Q值和一个简化的网络框架,提出了一种用于具有连续动作空间的非典型MDP的即时回报算法。然后,给出了两个考虑不确定性的非典型MDP示例,以说明所提算法的性能,即把足球传给移动的球员以及将足球踢过人墙。与现有的深度RL算法,如深度确定性策略梯度和近端策略优化相比,所提算法在学习效率、控制有效率和计算资源使用方面显示出显著优势。