强化学习中的演员-决斗-批评者方法。

The Actor-Dueling-Critic Method for Reinforcement Learning.

机构信息

College of Automation, Harbin Engineering University, Harbin 150001, China.

Department of Computer Science, Aalto University, 02150 Espoo, Finland.

出版信息

Sensors (Basel). 2019 Mar 30;19(7):1547. doi: 10.3390/s19071547.

DOI:10.3390/s19071547

PMID:30935035

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6479875/

Abstract

Model-free reinforcement learning is a powerful and efficient machine-learning paradigm which has been generally used in the robotic control domain. In the reinforcement learning setting, the value function method learns policies by maximizing the state-action value ( value), but it suffers from inaccurate estimation and results in poor performance in a stochastic environment. To mitigate this issue, we present an approach based on the actor-critic framework, and in the critic branch we modify the manner of estimating -value by introducing the advantage function, such as dueling network, which can estimate the action-advantage value. The action-advantage value is independent of state and environment noise, we use it as a fine-tuning factor to the estimated value. We refer to this approach as the actor-dueling-critic (ADC) network since the frame is inspired by the dueling network. Furthermore, we redesign the dueling network part in the critic branch to make it adapt to the continuous action space. The method was tested on gym classic control environments and an obstacle avoidance environment, and we design a noise environment to test the training stability. The results indicate the ADC approach is more stable and converges faster than the DDPG method in noise environments.

摘要

无模型强化学习是一种强大且高效的机器学习范式，已被广泛应用于机器人控制领域。在强化学习环境中，值函数方法通过最大化状态-动作值（value）来学习策略，但它受到不准确估计的影响，在随机环境中表现不佳。为了解决这个问题，我们提出了一种基于动作-评论者框架的方法，在评论者分支中，我们通过引入优势函数（如决斗网络）来修改估计-值的方式，从而可以估计动作优势值。动作优势值与状态和环境噪声无关，我们将其用作估计值的微调因子。由于框架受到决斗网络的启发，我们将这种方法称为动作决斗评论家（ADC）网络。此外，我们重新设计了评论者分支中的决斗网络部分，使其适应连续动作空间。该方法在 gym 经典控制环境和避障环境中进行了测试，并设计了噪声环境来测试训练稳定性。结果表明，在噪声环境中，ADC 方法比 DDPG 方法更稳定，收敛速度更快。