Yan Mengda, Yang Rennong, Zhang Ying, Yue Longfei, Hu Dongyuan
School of Air Traffic Control and Navigation, Air Force Engineering University, Xian, 710051, China.
Sci Rep. 2022 Nov 7;12(1):18888. doi: 10.1038/s41598-022-21756-6.
This paper proposes an algorithm for missile manoeuvring based on a hierarchical proximal policy optimization (PPO) reinforcement learning algorithm, which enables a missile to guide to a target and evade an interceptor at the same time. Based on the idea of task hierarchy, the agent has a two-layer structure, in which low-level agents control basic actions and are controlled by a high-level agent. The low level has two agents called a guidance agent and an evasion agent, which are trained in simple scenarios and embedded in the high-level agent. The high level has a policy selector agent, which chooses one of the low-level agents to activate at each decision moment. The reward functions for each agent are different, considering the guidance accuracy, flight time, and energy consumption metrics, as well as a field-of-view constraint. Simulation shows that the PPO algorithm without a hierarchical structure cannot complete the task, while the hierarchical PPO algorithm has a 100% success rate on a test dataset. The agent shows good adaptability and strong robustness to the second-order lag of autopilot and measurement noises. Compared with a traditional guidance law, the reinforcement learning guidance law has satisfactory guidance accuracy and significant advantages in average time and average energy consumption.
本文提出了一种基于分层近端策略优化(PPO)强化学习算法的导弹机动算法,该算法能使导弹在引导至目标的同时躲避拦截器。基于任务分层的思想,智能体具有两层结构,其中低级智能体控制基本动作并由高级智能体控制。低级层有两个智能体,分别称为制导智能体和规避智能体,它们在简单场景中进行训练并嵌入到高级智能体中。高级层有一个策略选择智能体,它在每个决策时刻选择激活一个低级智能体。考虑到制导精度、飞行时间、能量消耗指标以及视场约束,每个智能体的奖励函数各不相同。仿真表明,无分层结构的PPO算法无法完成任务,而分层PPO算法在测试数据集上的成功率为100%。该智能体对自动驾驶仪的二阶滞后和测量噪声表现出良好的适应性和强大的鲁棒性。与传统制导律相比,强化学习制导律具有令人满意的制导精度,并且在平均时间和平均能量消耗方面具有显著优势。