Department of Health Administration, Dankook University, Cheonan, 31116, South Korea.
Department of Healthcare Management, Eulji University, Seongnam, 13135, South Korea.
BMC Public Health. 2022 Aug 2;22(1):1476. doi: 10.1186/s12889-022-13719-3.
Injuries caused by RTA are classified under the International Classification of Diseases-10 as 'S00-T99' and represent imbalanced samples with a mortality rate of only 1.2% among all RTA victims. To predict the characteristics of external causes of road traffic accident (RTA) injuries and mortality, we compared performances based on differences in the correction and classification techniques for imbalanced samples.
The present study extracted and utilized data spanning over a 5-year period (2013-2017) from the Korean National Hospital Discharge In-depth Injury Survey (KNHDS), a national level survey conducted by the Korea Disease Control and Prevention Agency, A total of eight variables were used in the prediction, including patient, accident, and injury/disease characteristics. As the data was imbalanced, a sample consisting of only severe injuries was constructed and compared against the total sample. Considering the characteristics of the samples, preprocessing was performed in the study. The samples were standardized first, considering that they contained many variables with different units. Among the ensemble techniques for classification, the present study utilized Random Forest, Extra-Trees, and XGBoost. Four different over- and under-sampling techniques were used to compare the performance of algorithms using "accuracy", "precision", "recall", "F1", and "MCC".
The results showed that among the prediction techniques, XGBoost had the best performance. While the synthetic minority oversampling technique (SMOTE), a type of over-sampling, also demonstrated a certain level of performance, under-sampling was the most superior. Overall, prediction by the XGBoost model with samples using SMOTE produced the best results.
This study presented the results of an empirical comparison of the validity of sampling techniques and classification algorithms that affect the accuracy of imbalanced samples by combining two techniques. The findings could be used as reference data in classification analyses of imbalanced data in the medical field.
道路交通伤害(RTA)造成的损伤根据国际疾病分类第 10 版(ICD-10)被归类为“S00-T99”,在所有 RTA 受害者中,其死亡率仅为 1.2%,属于不平衡样本。为了预测道路交通伤害(RTA)损伤和死亡率的外部原因特征,我们比较了基于不平衡样本校正和分类技术差异的表现。
本研究从韩国疾病控制与预防署(Korea Disease Control and Prevention Agency)开展的全国性调查——韩国国家医院出院深入伤害调查(KNHDS)中提取并利用了 5 年(2013-2017 年)的数据。预测中使用了 8 个变量,包括患者、事故和损伤/疾病特征。由于数据不平衡,仅构建并比较了严重损伤的样本。考虑到样本的特点,本研究进行了预处理。首先,考虑到样本中包含许多具有不同单位的变量,对样本进行了标准化。在分类的集成技术中,本研究利用了随机森林、Extra-Trees 和 XGBoost。使用“准确性”、“精度”、“召回率”、“F1”和“MCC”比较了 4 种不同的过采样和欠采样技术对算法性能的影响。
结果表明,在预测技术中,XGBoost 的性能最佳。虽然过采样技术中的合成少数类过采样技术(SMOTE)也表现出一定的性能,但欠采样效果最佳。总体而言,使用 SMOTE 对 XGBoost 模型进行采样的预测结果最佳。
本研究通过结合两种技术,对影响不平衡样本准确性的采样技术和分类算法的有效性进行了实证比较。研究结果可作为医学领域不平衡数据分类分析的参考数据。