University of Michigan Transportation Research Institute, 2901 Baxter Road, Ann Arbor, MI, 48109, USA; Department of Industrial and Operations Engineering, University of Michigan, 1205 Beal Avenue, Ann Arbor, MI 48109, USA.
Department of Industrial and Operations Engineering, University of Michigan, 1205 Beal Avenue, Ann Arbor, MI 48109, USA.
Accid Anal Prev. 2018 Nov;120:250-261. doi: 10.1016/j.aap.2018.08.025. Epub 2018 Aug 30.
This study aims to classify the injury severity in motor-vehicle crashes with both high accuracy and sensitivity rates. The dataset used in this study contains 297,113 vehicle crashes, obtained from the Michigan Traffic Crash Facts (MTCF) dataset, from 2016-2017. Similar to any other crash dataset, different accident severity classes are not equally represented in MTCF. To account for the imbalanced classes, several techniques have been used, including under-sampling and over-sampling. Using five classification learning models (i.e., Logistic regression, Decision tree, Neural network, Gradient boosting model, and Naïve Bayes classifier), we classify the levels of injury severity and attempt to improve the classification performance by two training-testing methods including Bootstrap aggregation (or bagging) and majority voting. Furthermore, due to the imbalance present in the dataset, we use the geometric mean (G-mean) to evaluate the classification performance. We show that the classification performance is the highest when bagging is used with decision trees, with over-sampling treatment for imbalanced data. The effect of treatments for the imbalanced data is maximized when under-sampling is combined with bagging. In addition to the original five classes of injury severity in the MTCF dataset, we consider two additional classification problems, one with two classes and the other with three classes, to (1) investigate the impact of the number of classes on the performance of classification models, and (2) enable comparing our results with the literature.
本研究旨在以高准确率和灵敏度对机动车事故中的伤害严重程度进行分类。本研究使用的数据集包含了 297113 起车辆碰撞事故,这些数据来自于 2016 年至 2017 年的密歇根交通碰撞事实(MTCF)数据集。与任何其他碰撞数据集一样,MTCF 中不同的事故严重程度类别并不具有同等代表性。为了处理这些不平衡的类别,我们使用了几种技术,包括欠采样和过采样。我们使用了五个分类学习模型(即逻辑回归、决策树、神经网络、梯度提升模型和朴素贝叶斯分类器)对伤害严重程度进行分类,并尝试通过两种训练-测试方法(即自举聚合(或套袋)和多数投票)来提高分类性能。此外,由于数据集存在不平衡,我们使用几何平均值(G-mean)来评估分类性能。我们发现,当对不平衡数据进行过采样处理并使用决策树进行套袋时,分类性能最高。当与套袋结合使用欠采样时,对不平衡数据的处理效果最大化。除了 MTCF 数据集中原始的五种伤害严重程度类别外,我们还考虑了另外两种分类问题,一种是两类,另一种是三类,目的是(1)研究类别的数量对分类模型性能的影响,(2)使我们的结果与文献进行比较。