基于奖励设计和量子动作选择的时序逻辑下的安全强化学习。

Safe reinforcement learning under temporal logic with reward design and quantum action selection.

机构信息

Department of Mechanical Engineering, Lehigh University, 113 Research Drive, Bethlehem, PA, 18015, USA.

Department of Mechanical Engineering, University of Iowa, 3131 Seamans Center, Iowa City, IA, 52242, USA.

出版信息

Sci Rep. 2023 Feb 2;13(1):1925. doi: 10.1038/s41598-023-28582-4.

DOI:10.1038/s41598-023-28582-4

PMID:36732441

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9894922/

Abstract

This paper proposes an advanced Reinforcement Learning (RL) method, incorporating reward-shaping, safety value functions, and a quantum action selection algorithm. The method is model-free and can synthesize a finite policy that maximizes the probability of satisfying a complex task. Although RL is a promising approach, it suffers from unsafe traps and sparse rewards and becomes impractical when applied to real-world problems. To improve safety during training, we introduce a concept of safety values, which results in a model-based adaptive scenario due to online updates of transition probabilities. On the other hand, a high-level complex task is usually formulated via formal languages, including Linear Temporal Logic (LTL). Another novelty of this work is using an Embedded Limit-Deterministic Generalized Büchi Automaton (E-LDGBA) to represent an LTL formula. The obtained deterministic policy can generalize the tasks over infinite and finite horizons. We design an automaton-based reward, and the theoretical analysis shows that an agent can accomplish task specifications with the maximum probability by following the optimal policy. Furthermore, a reward shaping process is developed to avoid sparse rewards and enforce the RL convergence while keeping the optimal policies invariant. In addition, inspired by quantum computing, we propose a quantum action selection algorithm to replace the existing [Formula: see text]-greedy algorithm for the balance of exploration and exploitation during learning. Simulations demonstrate how the proposed framework can achieve good performance by dramatically reducing the times to visit unsafe states while converging optimal policies.

摘要

本文提出了一种先进的强化学习 (RL) 方法，该方法结合了奖励塑造、安全价值函数和量子动作选择算法。该方法是无模型的，可以综合出一个有限的策略，使满足复杂任务的概率最大化。尽管 RL 是一种很有前途的方法，但它存在不安全陷阱和稀疏奖励的问题，当应用于实际问题时变得不切实际。为了在训练过程中提高安全性，我们引入了安全值的概念，由于在线更新转移概率，这导致了基于模型的自适应场景。另一方面，高级复杂任务通常通过形式语言（包括线性时态逻辑 (LTL)）来表示。这项工作的另一个新颖之处是使用嵌入式限确定性广义 Büchi 自动机 (E-LDGBA) 来表示 LTL 公式。得到的确定性策略可以在无限和有限的时间范围内对任务进行泛化。我们设计了一种基于自动机的奖励，理论分析表明，代理可以通过遵循最优策略以最大概率完成任务规范。此外，还开发了一个奖励塑造过程，以避免稀疏奖励并在保持最优策略不变的情况下强制 RL 收敛。此外，受量子计算的启发，我们提出了一种量子动作选择算法，以替代现有的 [公式：见文本]-贪婪算法，在学习过程中平衡探索和利用。模拟表明，通过大大减少访问不安全状态的次数，同时收敛最优策略，所提出的框架如何实现良好的性能。