Zhao Rui, Fan Yuze, Li Yun, Wang Kui, Gao Fei, Gao Zhenhai
College of Automotive Engineering, Jilin University, Changchun 130025, China.
Graduate School of Information and Science Technology, The University of Tokyo, Tokyo 113-8654, Japan.
Sensors (Basel). 2024 Aug 11;24(16):5187. doi: 10.3390/s24165187.
The centralized coordination of Connected and Automated Vehicles (CAVs) at unsignalized intersections aims to enhance traffic efficiency, driving safety, and passenger comfort. Autonomous Intersection Management (AIM) systems introduce a novel approach for centralized coordination. However, existing rule-based and optimization methods often face the challenges of poor generalization and low computational efficiency when dealing with complex traffic environments and highly dynamic traffic conditions. Additionally, current Reinforcement Learning (RL)-based methods encounter difficulties around policy inference and safety. To address these issues, this study proposes Constraint-Guided Behavior Transformer for Safe Reinforcement Learning (CoBT-SRL), which uses transformers as the policy network to achieve efficient decision-making for vehicle driving behaviors. This method leverages the ability of transformers to capture long-range dependencies and improve data sample efficiency by using historical states, actions, and reward and cost returns to predict future actions. Furthermore, to enhance policy exploration performance, a sequence-level entropy regularizer is introduced to encourage policy exploration while ensuring the safety of policy updates. Simulation results indicate that CoBT-SRL exhibits stable training progress and converges effectively. CoBT-SRL outperforms other RL methods and vehicle intersection coordination schemes (VICS) based on optimal control in terms of traffic efficiency, driving safety, and passenger comfort.
在无信号交叉口对联网自动驾驶车辆(CAV)进行集中协调,旨在提高交通效率、驾驶安全性和乘客舒适度。自主交叉口管理(AIM)系统引入了一种集中协调的新方法。然而,现有的基于规则和优化的方法在处理复杂交通环境和高度动态交通状况时,往往面临泛化性差和计算效率低的挑战。此外,当前基于强化学习(RL)的方法在策略推理和安全性方面也存在困难。为了解决这些问题,本研究提出了用于安全强化学习的约束引导行为变换器(CoBT-SRL),它使用变换器作为策略网络,以实现车辆驾驶行为的高效决策。该方法利用变换器捕捉长距离依赖关系的能力,并通过使用历史状态、动作以及奖励和成本回报来预测未来动作,从而提高数据样本效率。此外,为了提高策略探索性能,引入了序列级熵正则化器,以在确保策略更新安全的同时鼓励策略探索。仿真结果表明,CoBT-SRL展现出稳定的训练进程并能有效收敛。在交通效率、驾驶安全性和乘客舒适度方面,CoBT-SRL优于其他基于RL的方法以及基于最优控制的车辆交叉口协调方案(VICS)。