Suppr超能文献

使用SMOTE过采样对三种分类器预测2型糖尿病性能的影响

The Impact of Oversampling with SMOTE on the Performance of 3 Classifiers in Prediction of Type 2 Diabetes.

作者信息

Ramezankhani Azra, Pournik Omid, Shahrabi Jamal, Azizi Fereidoun, Hadaegh Farzad, Khalili Davood

机构信息

Prevention of Metabolic Disorders Research Center, Research Institute for Endocrine Sciences, Shahid Beheshti University of Medical Sciences, Tehran, Iran (AR, FH, DK)

Department of Community Medicine, School of Medicine, Iran University of Medical Sciences, Tehran, Iran (OP)

出版信息

Med Decis Making. 2016 Jan;36(1):137-44. doi: 10.1177/0272989X14560647. Epub 2014 Dec 1.

Abstract

OBJECTIVE

To evaluate the impact of the synthetic minority oversampling technique (SMOTE) on the performance of probabilistic neural network (PNN), naïve Bayes (NB), and decision tree (DT) classifiers for predicting diabetes in a prospective cohort of the Tehran Lipid and Glucose Study (TLGS).

METHODS

. Data of the 6647 nondiabetic participants, aged 20 years or older with more than 10 years of follow-up, were used to develop prediction models based on 21 common risk factors. The minority class in the training dataset was oversampled using the SMOTE technique, at 100%, 200%, 300%, 400%, 500%, 600%, and 700% of its original size. The original and the oversampled training datasets were used to establish the classification models. Accuracy, sensitivity, specificity, precision, F-measure, and Youden's index were used to evaluated the performance of classifiers in the test dataset. To compare the performance of the 3 classification models, we used the ROC convex hull (ROCCH).

RESULTS

Oversampling the minority class at 700% (completely balanced) increased the sensitivity of the PNN, DT, and NB by 64%, 51%, and 5%, respectively, but decreased the accuracy and specificity of the 3 classification methods. NB had the best Youden's index before and after oversampling. The ROCCH showed that PNN is suboptimal for any class and cost conditions.

CONCLUSIONS

To determine a classifier with a machine learning algorithm like the PNN and DT, class skew in data should be considered. The NB and DT were optimal classifiers in a prediction task in an imbalanced medical database.

摘要

目的

评估合成少数类过采样技术(SMOTE)对概率神经网络(PNN)、朴素贝叶斯(NB)和决策树(DT)分类器在德黑兰脂质与葡萄糖研究(TLGS)前瞻性队列中预测糖尿病性能的影响。

方法

使用6647名年龄在20岁及以上、随访超过10年的非糖尿病参与者的数据,基于21个常见风险因素建立预测模型。使用SMOTE技术对训练数据集中的少数类进行过采样,采样比例为其原始大小的100%、200%、300%、400%、500%、600%和700%。使用原始和过采样的训练数据集建立分类模型。使用准确率、灵敏度、特异性、精确率、F值和尤登指数来评估测试数据集中分类器的性能。为了比较这3种分类模型的性能,我们使用了ROC凸包(ROCCH)。

结果

将少数类过采样700%(完全平衡)分别使PNN、DT和NB的灵敏度提高了64%、51%和5%,但降低了这3种分类方法的准确率和特异性。NB在过采样前后具有最佳的尤登指数。ROCCH表明,对于任何类别和成本条件,PNN都是次优的。

结论

为了确定像PNN和DT这样的机器学习算法的分类器,应考虑数据中的类别偏差。在不平衡的医学数据库的预测任务中,NB和DT是最优分类器。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验