Chen Weimin, Wong Kelvin Kian Loong, Long Sifan, Sun Zhili
School of Information and Electronics, Hunan City University, Yiyang 413000, China.
School of Computer Science and Engineering, Central South University, Changsha 410075, China.
Entropy (Basel). 2022 Mar 22;24(4):440. doi: 10.3390/e24040440.
In the field of reinforcement learning, we propose a Correct Proximal Policy Optimization (CPPO) algorithm based on the modified penalty factor and relative entropy in order to solve the robustness and stationarity of traditional algorithms. Firstly, In the process of reinforcement learning, this paper establishes a strategy evaluation mechanism through the policy distribution function. Secondly, the state space function is quantified by introducing entropy, whereby the approximation policy is used to approximate the real policy distribution, and the kernel function estimation and calculation of relative entropy is used to fit the reward function based on complex problem. Finally, through the comparative analysis on the classic test cases, we demonstrated that our proposed algorithm is effective, has a faster convergence speed and better performance than the traditional PPO algorithm, and the measure of the relative entropy can show the differences. In addition, it can more efficiently use the information of complex environment to learn policies. At the same time, not only can our paper explain the rationality of the policy distribution theory, the proposed framework can also balance between iteration steps, computational complexity and convergence speed, and we also introduced an effective measure of performance using the relative entropy concept.
在强化学习领域,我们提出了一种基于修正惩罚因子和相对熵的正确近端策略优化(CPPO)算法,以解决传统算法的鲁棒性和平稳性问题。首先,在强化学习过程中,本文通过策略分布函数建立了策略评估机制。其次,通过引入熵对状态空间函数进行量化,从而使用近似策略来逼近真实策略分布,并基于复杂问题利用相对熵的核函数估计和计算来拟合奖励函数。最后,通过对经典测试案例的对比分析,我们证明了所提出的算法是有效的,与传统PPO算法相比具有更快的收敛速度和更好的性能,并且相对熵的度量能够体现差异。此外,它能够更高效地利用复杂环境的信息来学习策略。同时,本文不仅能够解释策略分布理论的合理性,所提出的框架还能在迭代步数、计算复杂度和收敛速度之间取得平衡,并且我们还引入了一种使用相对熵概念的有效性能度量方法。