• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

具有双重稳健性的安全强化学习

Safe Reinforcement Learning With Dual Robustness.

作者信息

Li Zeyang, Hu Chuxiong, Wang Yunan, Yang Yujie, Li Shengbo Eben

出版信息

IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):10876-10890. doi: 10.1109/TPAMI.2024.3443916. Epub 2024 Nov 6.

DOI:10.1109/TPAMI.2024.3443916
PMID:39146157
Abstract

Reinforcement learning (RL) agents are vulnerable to adversarial disturbances, which can deteriorate task performance or break down safety specifications. Existing methods either address safety requirements under the assumption of no adversary (e.g., safe RL) or only focus on robustness against performance adversaries (e.g., robust RL). Learning one policy that is both safe and robust under any adversaries remains a challenging open problem. The difficulty is how to tackle two intertwined aspects in the worst cases: feasibility and optimality. The optimality is only valid inside a feasible region (i.e., robust invariant set), while the identification of maximal feasible region must rely on how to learn the optimal policy. To address this issue, we propose a systematic framework to unify safe RL and robust RL, including the problem formulation, iteration scheme, convergence analysis and practical algorithm design. The unification is built upon constrained two-player zero-sum Markov games, in which the objective for protagonist is twofold. For states inside the maximal robust invariant set, the goal is to pursue rewards under the condition of guaranteed safety; for states outside the maximal robust invariant set, the goal is to reduce the extent of constraint violation. A dual policy iteration scheme is proposed, which simultaneously optimizes a task policy and a safety policy. We prove that the iteration scheme converges to the optimal task policy which maximizes the twofold objective in the worst cases, and the optimal safety policy which stays as far away from the safety boundary. The convergence of safety policy is established by exploiting the monotone contraction property of safety self-consistency operators, and that of task policy depends on the transformation of safety constraints into state-dependent action spaces. By adding two adversarial networks (one is for safety guarantee and the other is for task performance), we propose a practical deep RL algorithm for constrained zero-sum Markov games, called dually robust actor-critic (DRAC). The evaluations with safety-critical benchmarks demonstrate that DRAC achieves high performance and persistent safety under all scenarios (no adversary, safety adversary, performance adversary), outperforming all baselines by a large margin.

摘要

强化学习(RL)智能体容易受到对抗性干扰,这可能会降低任务性能或破坏安全规范。现有方法要么在无对手的假设下解决安全要求(例如,安全强化学习),要么仅关注针对性能对手的鲁棒性(例如,鲁棒强化学习)。学习在任何对手情况下都安全且鲁棒的单一策略仍然是一个具有挑战性的开放问题。困难在于如何在最坏情况下处理两个相互交织的方面:可行性和最优性。最优性仅在可行区域(即鲁棒不变集)内有效,而最大可行区域的识别必须依赖于如何学习最优策略。为了解决这个问题,我们提出了一个系统框架来统一安全强化学习和鲁棒强化学习,包括问题表述、迭代方案、收敛性分析和实用算法设计。这种统一建立在约束两人零和马尔可夫博弈之上,其中主角的目标有两个。对于最大鲁棒不变集内的状态,目标是在保证安全的条件下追求奖励;对于最大鲁棒不变集外的状态,目标是减少违反约束的程度。提出了一种对偶策略迭代方案,该方案同时优化任务策略和安全策略。我们证明,该迭代方案收敛到在最坏情况下最大化双重目标的最优任务策略,以及尽可能远离安全边界的最优安全策略。安全策略的收敛是通过利用安全自一致性算子的单调收缩性质建立的,而任务策略的收敛则取决于将安全约束转换为状态依赖动作空间。通过添加两个对抗网络(一个用于安全保证,另一个用于任务性能),我们提出了一种用于约束零和马尔可夫博弈的实用深度强化学习算法,称为对偶鲁棒演员-评论家(DRAC)。使用安全关键基准进行的评估表明,DRAC在所有场景(无对手、安全对手、性能对手)下都实现了高性能和持续安全,大幅优于所有基线。

相似文献

1
Safe Reinforcement Learning With Dual Robustness.具有双重稳健性的安全强化学习
IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):10876-10890. doi: 10.1109/TPAMI.2024.3443916. Epub 2024 Nov 6.
2
Learn Zero-Constraint-Violation Safe Policy in Model-Free Constrained Reinforcement Learning.在无模型约束强化学习中学习零约束违反安全策略。
IEEE Trans Neural Netw Learn Syst. 2025 Feb;36(2):2327-2341. doi: 10.1109/TNNLS.2023.3348422. Epub 2025 Feb 6.
3
Kernel-based least squares policy iteration for reinforcement learning.用于强化学习的基于核的最小二乘策略迭代
IEEE Trans Neural Netw. 2007 Jul;18(4):973-92. doi: 10.1109/TNN.2007.899161.
4
Data-Based Optimal Consensus Control for Multiagent Systems With Policy Gradient Reinforcement Learning.基于数据的多智能体系统最优共识控制与策略梯度强化学习
IEEE Trans Neural Netw Learn Syst. 2022 Aug;33(8):3872-3883. doi: 10.1109/TNNLS.2021.3054685. Epub 2022 Aug 3.
5
A Maximum Divergence Approach to Optimal Policy in Deep Reinforcement Learning.深度强化学习中最优策略的最大散度方法。
IEEE Trans Cybern. 2023 Mar;53(3):1499-1510. doi: 10.1109/TCYB.2021.3104612. Epub 2023 Feb 15.
6
CVaR-Constrained Policy Optimization for Safe Reinforcement Learning.用于安全强化学习的条件风险价值约束策略优化
IEEE Trans Neural Netw Learn Syst. 2025 Jan;36(1):830-841. doi: 10.1109/TNNLS.2023.3331304. Epub 2025 Jan 7.
7
RAP Vol: Robust Adversary Populations With Volume Diversity Measure.RAP卷:具有体积多样性度量的稳健对抗群体
IEEE Trans Neural Netw Learn Syst. 2024 Dec;35(12):18485-18498. doi: 10.1109/TNNLS.2023.3317145. Epub 2024 Dec 2.
8
Towards Robust Decision-Making for Autonomous Highway Driving Based on Safe Reinforcement Learning.基于安全强化学习的稳健自主高速公路驾驶决策方法
Sensors (Basel). 2024 Jun 26;24(13):4140. doi: 10.3390/s24134140.
9
Optimal Control for Constrained Discrete-Time Nonlinear Systems Based on Safe Reinforcement Learning.基于安全强化学习的约束离散时间非线性系统最优控制
IEEE Trans Neural Netw Learn Syst. 2025 Jan;36(1):854-865. doi: 10.1109/TNNLS.2023.3326397. Epub 2025 Jan 7.
10
Optimal Policy of Multiplayer Poker via Actor-Critic Reinforcement Learning.通过演员-评论家强化学习实现多人扑克的最优策略
Entropy (Basel). 2022 May 30;24(6):774. doi: 10.3390/e24060774.