College of Engineering, University of Georgia, Athens, GA 30602, USA.
College of Engineering, University of Georgia, Athens, GA 30602, USA.
Accid Anal Prev. 2021 Sep;159:106240. doi: 10.1016/j.aap.2021.106240. Epub 2021 Jun 16.
Crash data analysis is commonly subjected to imbalanced data. Varied by facility and control types, some crash types are more frequent than others. However, uncommon crash types are routinely more severe and associated with higher economic and societal costs, and thus crucial to prevent. It is paramount to develop inferential models that can reliably predict crash types and identify attributing factors, especially for the severe types. The current process of modeling towards infrequent events generally disregards disparity in data representation, which can lead to biased models. Therefore, mitigating and managing imbalanced data is essential to the development of meaningful and robust models that help reveal effective countermeasures. This study focuses on comparing the effects of resampling techniques on the performance of both machine learning and classical statistical models for classifying and predicting different crash types on freeways. Specifically, a mixed sampling approach featuring a cluster-based under-sampling coupled with three popular over-sampling methods (i.e., random over-sampling, synthetic minority over-sampling, and adaptive synthetic sampling) were investigated with respect to four crash classification models, including three ensemble machine learning models (CatBoost, XGBoost, and Random Forests) and one classic statistical model (Nested Logit). This study concluded that all three resampling methods consistently enhanced the performance of all models. Among the three over-sampling methods, the adaptive synthetic sampling approach performed best and tremendously improved the prediction of minority crash types without impeding the prediction of the majority crash type. This is likely due to the density-based approach of adaptive synthetic sampling in creating synthetic instances that are more congruent with the underlying manifold structure embodied in the high-dimensional feature space.
碰撞数据分析通常会受到不平衡数据的影响。由于设施和控制类型的不同,某些碰撞类型比其他类型更为常见。然而,不常见的碰撞类型通常更为严重,且与更高的经济和社会成本相关,因此预防这些类型至关重要。开发能够可靠地预测碰撞类型并识别归因因素的推理模型非常重要,尤其是对于严重类型的碰撞。目前,针对罕见事件的建模过程通常忽略了数据表示中的差异,这可能导致模型存在偏差。因此,缓解和管理不平衡数据对于开发有意义且稳健的模型至关重要,这些模型有助于揭示有效的对策。本研究专注于比较重采样技术对机器学习和经典统计模型在分类和预测高速公路上不同碰撞类型性能的影响。具体而言,采用基于聚类的欠采样与三种流行的过采样方法(即随机过采样、合成少数过采样和自适应合成采样)相结合的混合采样方法,针对四种碰撞分类模型进行了研究,包括三种集成机器学习模型(CatBoost、XGBoost 和随机森林)和一种经典统计模型(嵌套 Logit)。本研究得出结论,所有三种重采样方法都一致地提高了所有模型的性能。在三种过采样方法中,自适应合成采样方法的表现最好,极大地提高了少数碰撞类型的预测准确性,而不会影响多数碰撞类型的预测。这可能是由于自适应合成采样基于密度的方法在创建与高维特征空间中体现的底层流形结构更一致的合成实例方面的优势。