Li Tianyi, Yang Genke, Chu Jian
IEEE Trans Cybern. 2024 May;54(5):3051-3064. doi: 10.1109/TCYB.2023.3254596. Epub 2024 Apr 16.
Efficient and intelligent exploration remains a major challenge in the field of deep reinforcement learning (DRL). Bayesian inference with a distributional representation is usually an effective way to improve the exploration ability of the RL agent. However, when optimizing Bayesian neural networks (BNNs), most algorithms need to specify an explicit parameter distribution such as a multivariate Gaussian distribution. This may reduce the flexibility of model representation and affect the algorithm performance. Therefore, to improve sample efficiency and exploration based on Bayesian methods, we propose a novel implicit posteriori parameter distribution optimization (IPPDO) algorithm. First, we adopt a distributional perspective on the parameter and model it with an implicit distribution, which is approximated by generative models. Each model corresponds to a learned latent space, providing structured stochasticity for each layer in the network. Next, to make it possible to optimize an implicit posteriori parameter distribution, we build an energy-based model (EBM) with value function to represent the implicit distribution which is not constrained by any analytic density function. Then, we design a training algorithm based on amortized Stein variational gradient descent (SVGD) to improve the model learning efficiency. We compare IPPDO with other prevailing DRL algorithms on the OpenAI Gym, MuJoCo, and Box2D platforms. Experiments on various tasks demonstrate that the proposed algorithm can represent the parameter uncertainty implicitly for a learned policy and can consistently outperform competing approaches.
高效且智能的探索仍是深度强化学习(DRL)领域的一项重大挑战。采用分布表示的贝叶斯推理通常是提高强化学习智能体探索能力的有效方法。然而,在优化贝叶斯神经网络(BNN)时,大多数算法需要指定一个明确的参数分布,如多元高斯分布。这可能会降低模型表示的灵活性并影响算法性能。因此,为了提高基于贝叶斯方法的样本效率和探索能力,我们提出了一种新颖的隐式后验参数分布优化(IPPDO)算法。首先,我们从分布的角度看待参数,并用隐式分布对其进行建模,该隐式分布由生成模型近似。每个模型对应一个学习到的潜在空间,为网络中的每一层提供结构化的随机性。接下来,为了能够优化隐式后验参数分布,我们构建了一个带有价值函数的基于能量的模型(EBM)来表示不受任何解析密度函数约束的隐式分布。然后,我们设计了一种基于摊销斯坦变分梯度下降(SVGD)的训练算法来提高模型学习效率。我们在OpenAI Gym、MuJoCo和Box2D平台上将IPPDO与其他主流DRL算法进行了比较。在各种任务上的实验表明,所提出的算法可以为学习到的策略隐式地表示参数不确定性,并且始终优于竞争方法。