一种基于混合机器学习的模型，用于通过航空大数据预测航班延误。

A hybrid machine learning-based model for predicting flight delay through aviation big data.

作者信息

Dai Min

机构信息

CAAC Academy, Civil Aviation Flight University of China, Guanghan, 618307, China.

出版信息

Sci Rep. 2024 Feb 26;14(1):4603. doi: 10.1038/s41598-024-55217-z.

DOI:10.1038/s41598-024-55217-z

PMID:38409455

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10897135/

Abstract

The prediction of flight delays is one of the important and challenging issues in the field of scheduling and planning flights by airports and airlines. Therefore, in recent years, we have witnessed various methods to solve this problem using machine learning techniques. In this article, a new method is proposed to address these issues. In the proposed method, a group of potential indicators related to flight delay is introduced, and a combination of ANOVA and the Forward Sequential Feature Selection (FSFS) algorithm is used to determine the most influential indicators on flight delays. To overcome the challenges related to large flight data volumes, a clustering strategy based on the DBSCAN algorithm is employed. In this approach, samples are clustered into similar groups, and a separate learning model is used to predict flight delays for each group. This strategy allows the problem to be decomposed into smaller sub-problems, leading to improved prediction system performance in terms of accuracy (by 2.49%) and processing speed (by 39.17%). The learning model used in each cluster is a novel structure based on a random forest, where each tree component is optimized and weighted using the Coyote Optimization Algorithm (COA). Optimizing the structure of each tree component and assigning weighted values to them results in a minimum 5.3% increase in accuracy compared to the conventional random forest model. The performance of the proposed method in predicting flight delays is tested and compared with previous research. The findings demonstrate that the proposed approach achieves an average accuracy of 97.2% which indicates a 4.7% improvement compared to previous efforts.

摘要

航班延误预测是机场和航空公司在航班调度与规划领域中重要且具有挑战性的问题之一。因此，近年来，我们见证了利用机器学习技术解决这一问题的各种方法。本文提出了一种新方法来解决这些问题。在所提出的方法中，引入了一组与航班延误相关的潜在指标，并使用方差分析（ANOVA）和前向顺序特征选择（FSFS）算法的组合来确定对航班延误影响最大的指标。为了克服与大量航班数据相关的挑战，采用了基于密度空间聚类算法（DBSCAN）的聚类策略。在这种方法中，样本被聚类成相似的组，并使用单独的学习模型为每个组预测航班延误。这种策略允许将问题分解为较小的子问题，从而在准确性（提高2.49%）和处理速度（提高39.17%）方面提高预测系统的性能。每个聚类中使用的学习模型是一种基于随机森林的新颖结构，其中每个树组件使用土狼优化算法（COA）进行优化和加权。与传统随机森林模型相比，优化每个树组件的结构并为其赋予加权值可使准确率至少提高5.3%。对所提出方法在预测航班延误方面的性能进行了测试，并与先前的研究进行了比较。结果表明，所提出的方法实现了97.2%的平均准确率，与先前的研究相比提高了4.7%。