Li Zeyang, Hu Chuxiong, Wang Yunan, Yang Yujie, Li Shengbo Eben
IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):10876-10890. doi: 10.1109/TPAMI.2024.3443916. Epub 2024 Nov 6.
Reinforcement learning (RL) agents are vulnerable to adversarial disturbances, which can deteriorate task performance or break down safety specifications. Existing methods either address safety requirements under the assumption of no adversary (e.g., safe RL) or only focus on robustness against performance adversaries (e.g., robust RL). Learning one policy that is both safe and robust under any adversaries remains a challenging open problem. The difficulty is how to tackle two intertwined aspects in the worst cases: feasibility and optimality. The optimality is only valid inside a feasible region (i.e., robust invariant set), while the identification of maximal feasible region must rely on how to learn the optimal policy. To address this issue, we propose a systematic framework to unify safe RL and robust RL, including the problem formulation, iteration scheme, convergence analysis and practical algorithm design. The unification is built upon constrained two-player zero-sum Markov games, in which the objective for protagonist is twofold. For states inside the maximal robust invariant set, the goal is to pursue rewards under the condition of guaranteed safety; for states outside the maximal robust invariant set, the goal is to reduce the extent of constraint violation. A dual policy iteration scheme is proposed, which simultaneously optimizes a task policy and a safety policy. We prove that the iteration scheme converges to the optimal task policy which maximizes the twofold objective in the worst cases, and the optimal safety policy which stays as far away from the safety boundary. The convergence of safety policy is established by exploiting the monotone contraction property of safety self-consistency operators, and that of task policy depends on the transformation of safety constraints into state-dependent action spaces. By adding two adversarial networks (one is for safety guarantee and the other is for task performance), we propose a practical deep RL algorithm for constrained zero-sum Markov games, called dually robust actor-critic (DRAC). The evaluations with safety-critical benchmarks demonstrate that DRAC achieves high performance and persistent safety under all scenarios (no adversary, safety adversary, performance adversary), outperforming all baselines by a large margin.
强化学习(RL)智能体容易受到对抗性干扰,这可能会降低任务性能或破坏安全规范。现有方法要么在无对手的假设下解决安全要求(例如,安全强化学习),要么仅关注针对性能对手的鲁棒性(例如,鲁棒强化学习)。学习在任何对手情况下都安全且鲁棒的单一策略仍然是一个具有挑战性的开放问题。困难在于如何在最坏情况下处理两个相互交织的方面:可行性和最优性。最优性仅在可行区域(即鲁棒不变集)内有效,而最大可行区域的识别必须依赖于如何学习最优策略。为了解决这个问题,我们提出了一个系统框架来统一安全强化学习和鲁棒强化学习,包括问题表述、迭代方案、收敛性分析和实用算法设计。这种统一建立在约束两人零和马尔可夫博弈之上,其中主角的目标有两个。对于最大鲁棒不变集内的状态,目标是在保证安全的条件下追求奖励;对于最大鲁棒不变集外的状态,目标是减少违反约束的程度。提出了一种对偶策略迭代方案,该方案同时优化任务策略和安全策略。我们证明,该迭代方案收敛到在最坏情况下最大化双重目标的最优任务策略,以及尽可能远离安全边界的最优安全策略。安全策略的收敛是通过利用安全自一致性算子的单调收缩性质建立的,而任务策略的收敛则取决于将安全约束转换为状态依赖动作空间。通过添加两个对抗网络(一个用于安全保证,另一个用于任务性能),我们提出了一种用于约束零和马尔可夫博弈的实用深度强化学习算法,称为对偶鲁棒演员-评论家(DRAC)。使用安全关键基准进行的评估表明,DRAC在所有场景(无对手、安全对手、性能对手)下都实现了高性能和持续安全,大幅优于所有基线。