Golestani Ali, Rezaei Nazila, Malekpour Mohammad-Reza, Ahmadi Naser, Ataei Seyed Mohammad-Navid, Khosravi Sepehr, Jafari Ayyoob, Shahraz Saeid, Farzadfar Farshad
Non-Communicable Diseases Research Center, Endocrinology and Metabolism Population Sciences Institute, Tehran University of Medical Sciences, Tehran, Iran.
Endocrinology and Metabolism Research Center, Endocrinology and Metabolism Clinical Sciences Institute, Tehran University of Medical Sciences, Tehran, Iran.
PLoS One. 2025 Jul 8;20(7):e0326483. doi: 10.1371/journal.pone.0326483. eCollection 2025.
Road traffic accidents (RTAs) are a major public health concern with significant health and economic burdens. Identifying high-risk areas and key contributing factors is essential for developing targeted interventions. While machine learning (ML) has been increasingly used to predict RTAs, the lack of interpretability limits its applicability in policymaking. This study aimed to utilize interpretable ML models to predict the occurrence of errors in road accident hotspots using telematics data in Iran and interpret the most influential predictors.
We utilized data collected via telematics from 1673 intercity buses throughout the year 2020, spanning cities across all provinces of Iran. Merging this data with a weather-related dataset resulted in a comprehensive dataset containing location, time, weather, and error type variables. After preprocessing, 619,988 records without any missing values were used to train and compare the performance of six machine learning models including logistic regression, K-nearest neighbors, random forest, Extreme Gradient Boosting (XGBoost), Naïve Bayes, and support vector machine. The best model was selected for interpretation using SHAP (SHapley Additive exPlanation). Due to the high imbalance in the outcome, an ensemble approach was applied to train all models.
XGBoost demonstrated the best performance with an area under the curve (AUC) of 91.70% (95% uncertainty interval: 91.33% - 92.09%). SHAP values highlighted spatial-related variables, particularly the province of error and road type, as the most critical features for predicting errors in accident hotspots in Iran. Fatigue, as a behavioral error, was associated with a higher risk of predicting errors in accident hotspots, and certain weather-related variables including dew points and relative humidity also exhibited importance. However, temporal variables did not contribute significantly to the prediction.
By integrating spatiotemporal, behavioral, and weather-related variables, our study highlighted the dominance of spatial factors in predicting errors in accident hotspots. These findings underscore the need for targeted road infrastructure improvements and data-driven policymaking to mitigate RTA risks.
道路交通事故(RTAs)是一个重大的公共卫生问题,带来了巨大的健康和经济负担。识别高风险区域和关键促成因素对于制定有针对性的干预措施至关重要。虽然机器学习(ML)已越来越多地用于预测道路交通事故,但缺乏可解释性限制了其在政策制定中的适用性。本研究旨在利用可解释的ML模型,使用伊朗的远程信息处理数据预测道路事故热点中的错误发生情况,并解释最具影响力的预测因素。
我们利用2020年全年通过远程信息处理从伊朗所有省份的1673辆城际巴士收集的数据。将这些数据与一个与天气相关的数据集合并,得到了一个包含位置、时间、天气和错误类型变量的综合数据集。经过预处理后,使用619,988条无任何缺失值的记录来训练和比较六种机器学习模型的性能,包括逻辑回归、K近邻、随机森林、极端梯度提升(XGBoost)、朴素贝叶斯和支持向量机。使用SHAP(SHapley加法解释)选择最佳模型进行解释。由于结果存在高度不平衡,采用集成方法训练所有模型。
XGBoost表现最佳,曲线下面积(AUC)为91.70%(95%不确定区间:91.33% - 92.09%)。SHAP值突出了与空间相关的变量,特别是错误发生省份和道路类型,是预测伊朗事故热点中错误的最关键特征。疲劳作为一种行为错误,与事故热点中预测错误的较高风险相关,某些与天气相关的变量,包括露点和相对湿度也显示出重要性。然而,时间变量对预测的贡献不大。
通过整合时空、行为和与天气相关的变量,我们的研究突出了空间因素在预测事故热点中错误方面的主导地位。这些发现强调了有针对性地改善道路基础设施和数据驱动的政策制定以降低道路交通事故风险的必要性。