强化学习中的隐式后验参数分布优化

Implicit Posteriori Parameter Distribution Optimization in Reinforcement Learning.

作者信息

Li Tianyi, Yang Genke, Chu Jian

出版信息

IEEE Trans Cybern. 2024 May;54(5):3051-3064. doi: 10.1109/TCYB.2023.3254596. Epub 2024 Apr 16.

DOI:10.1109/TCYB.2023.3254596

Abstract

Efficient and intelligent exploration remains a major challenge in the field of deep reinforcement learning (DRL). Bayesian inference with a distributional representation is usually an effective way to improve the exploration ability of the RL agent. However, when optimizing Bayesian neural networks (BNNs), most algorithms need to specify an explicit parameter distribution such as a multivariate Gaussian distribution. This may reduce the flexibility of model representation and affect the algorithm performance. Therefore, to improve sample efficiency and exploration based on Bayesian methods, we propose a novel implicit posteriori parameter distribution optimization (IPPDO) algorithm. First, we adopt a distributional perspective on the parameter and model it with an implicit distribution, which is approximated by generative models. Each model corresponds to a learned latent space, providing structured stochasticity for each layer in the network. Next, to make it possible to optimize an implicit posteriori parameter distribution, we build an energy-based model (EBM) with value function to represent the implicit distribution which is not constrained by any analytic density function. Then, we design a training algorithm based on amortized Stein variational gradient descent (SVGD) to improve the model learning efficiency. We compare IPPDO with other prevailing DRL algorithms on the OpenAI Gym, MuJoCo, and Box2D platforms. Experiments on various tasks demonstrate that the proposed algorithm can represent the parameter uncertainty implicitly for a learned policy and can consistently outperform competing approaches.

摘要

高效且智能的探索仍是深度强化学习（DRL）领域的一项重大挑战。采用分布表示的贝叶斯推理通常是提高强化学习智能体探索能力的有效方法。然而，在优化贝叶斯神经网络（BNN）时，大多数算法需要指定一个明确的参数分布，如多元高斯分布。这可能会降低模型表示的灵活性并影响算法性能。因此，为了提高基于贝叶斯方法的样本效率和探索能力，我们提出了一种新颖的隐式后验参数分布优化（IPPDO）算法。首先，我们从分布的角度看待参数，并用隐式分布对其进行建模，该隐式分布由生成模型近似。每个模型对应一个学习到的潜在空间，为网络中的每一层提供结构化的随机性。接下来，为了能够优化隐式后验参数分布，我们构建了一个带有价值函数的基于能量的模型（EBM）来表示不受任何解析密度函数约束的隐式分布。然后，我们设计了一种基于摊销斯坦变分梯度下降（SVGD）的训练算法来提高模型学习效率。我们在OpenAI Gym、MuJoCo和Box2D平台上将IPPDO与其他主流DRL算法进行了比较。在各种任务上的实验表明，所提出的算法可以为学习到的策略隐式地表示参数不确定性，并且始终优于竞争方法。

相似文献

Implicit Posteriori Parameter Distribution Optimization in Reinforcement Learning.强化学习中的隐式后验参数分布优化

IEEE Trans Cybern. 2024 May;54(5):3051-3064. doi: 10.1109/TCYB.2023.3254596. Epub 2024 Apr 16.

Inference-Based Posteriori Parameter Distribution Optimization.基于推理的后验参数分布优化。

IEEE Trans Cybern. 2022 May;52(5):3006-3017. doi: 10.1109/TCYB.2020.3023127. Epub 2022 May 19.

Distributional Policy Gradient With Distributional Value Function.基于分布值函数的分布策略梯度

IEEE Trans Neural Netw Learn Syst. 2025 Apr;36(4):6556-6568. doi: 10.1109/TNNLS.2024.3386225. Epub 2025 Apr 4.

Distributional generative adversarial imitation learning with reproducing kernel generalization.基于再生核泛化的分布生成对抗模仿学习。

Neural Netw. 2023 Aug;165:43-59. doi: 10.1016/j.neunet.2023.05.027. Epub 2023 May 25.

Stabilizing Training of Generative Adversarial Nets via Langevin Stein Variational Gradient Descent.通过朗之万斯坦变分梯度下降对生成对抗网络进行稳定训练

IEEE Trans Neural Netw Learn Syst. 2022 Jul;33(7):2768-2780. doi: 10.1109/TNNLS.2020.3045082. Epub 2022 Jul 6.

Measuring the Uncertainty of Predictions in Deep Neural Networks with Variational Inference.用变分推断测量深度神经网络的预测不确定性。

Sensors (Basel). 2020 Oct 23;20(21):6011. doi: 10.3390/s20216011.

Variational Information Bottleneck Regularized Deep Reinforcement Learning for Efficient Robotic Skill Adaptation.变分信息瓶颈正则化深度强化学习在机器人高效技能自适应中的应用。

Sensors (Basel). 2023 Jan 9;23(2):762. doi: 10.3390/s23020762.

An off-policy multi-agent stochastic policy gradient algorithm for cooperative continuous control.一种用于合作连续控制的离策略多智能体随机策略梯度算法。

Neural Netw. 2024 Feb;170:610-621. doi: 10.1016/j.neunet.2023.11.046. Epub 2023 Nov 23.

Energy-efficient and damage-recovery slithering gait design for a snake-like robot based on reinforcement learning and inverse reinforcement learning.基于强化学习和逆强化学习的蛇形机器人节能与损伤恢复蠕动步态设计。

Neural Netw. 2020 Sep;129:323-333. doi: 10.1016/j.neunet.2020.05.029. Epub 2020 Jun 16.

Variational HyperAdam: A Meta-Learning Approach to Network Training.变分超Adam：一种网络训练的元学习方法。

IEEE Trans Pattern Anal Mach Intell. 2022 Aug;44(8):4469-4484. doi: 10.1109/TPAMI.2021.3061581. Epub 2022 Jul 1.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

强化学习中的隐式后验参数分布优化

Implicit Posteriori Parameter Distribution Optimization in Reinforcement Learning.

作者信息

出版信息

相似文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献