Suppr超能文献

SMOTE方法对机器学习和集成学习性能结果的影响:用于解决2型糖尿病患者总睾酮缺乏预测数据中的类不平衡问题

The Impact of the SMOTE Method on Machine Learning and Ensemble Learning Performance Results in Addressing Class Imbalance in Data Used for Predicting Total Testosterone Deficiency in Type 2 Diabetes Patients.

作者信息

Kivrak Mehmet, Avci Ugur, Uzun Hakki, Ardic Cuneyt

机构信息

Faculty of Medicine, Biostatistics and Medical Informatics, Recep Tayyip Erdogan University, Rize 53100, Türkiye.

Faculty of Medicine, Endocrinology and Metabolism, Recep Tayyip Erdogan University, Rize 53100, Türkiye.

出版信息

Diagnostics (Basel). 2024 Nov 22;14(23):2634. doi: 10.3390/diagnostics14232634.

Abstract

BACKGROUND AND OBJECTIVE

Diabetes Mellitus is a long-term, multifaceted metabolic condition that necessitates ongoing medical management. Hypogonadism is a syndrome that is a clinical and/or biochemical indicator of testosterone deficiency. Cross-sectional studies have reported that 20-80.4% of all men with Type 2 diabetes have hypogonadism, and Type 2 diabetes is related to low testosterone. This study presents an analysis of the use of ML and EL classifiers in predicting testosterone deficiency. In our study, we compared optimized traditional ML classifiers and three EL classifiers using grid search and stratified k-fold cross-validation. We used the SMOTE method for the class imbalance problem.

METHODS

This database contains 3397 patients for the assessment of testosterone deficiency. Among these patients, 1886 patients with Type 2 diabetes were included in the study. In the data preprocessing stage, firstly, outlier/excessive observation analyses were performed with LOF and missing value analyses were performed with random forest. The SMOTE is a method for generating synthetic samples of the minority class. Four basic classifiers, namely MLP, RF, ELM and LR, were used as first-level classifiers. Tree ensemble classifiers, namely ADA, XGBoost and SGB, were used as second-level classifiers.

RESULTS

After the SMOTE, while the diagnostic accuracy decreased in all base classifiers except ELM, sensitivity values increased in all classifiers. Similarly, while the specificity values decreased in all classifiers, F1 score increased. The RF classifier gave more successful results on the base-training dataset. The most successful ensemble classifier in the training dataset was the ADA classifier in the original data and in the SMOTE data. In terms of the testing data, XGBoost is the most suitable model for your intended use in evaluating model performance. XGBoost, which exhibits a balanced performance especially when the SMOTE is used, can be preferred to correct class imbalance.

CONCLUSIONS

The SMOTE is used to correct the class imbalance in the original data. However, as seen in this study, when the SMOTE was applied, the diagnostic accuracy decreased in some models but the sensitivity increased significantly. This shows the positive effects of the SMOTE in terms of better predicting the minority class.

摘要

背景与目的

糖尿病是一种长期的、多方面的代谢性疾病,需要持续的医疗管理。性腺功能减退是一种综合征,是睾酮缺乏的临床和/或生化指标。横断面研究报告称,2型糖尿病男性患者中有20 - 80.4%患有性腺功能减退,且2型糖尿病与低睾酮水平有关。本研究对使用机器学习(ML)和极端学习机(EL)分类器预测睾酮缺乏情况进行了分析。在我们的研究中,我们使用网格搜索和分层k折交叉验证比较了优化后的传统ML分类器和三种EL分类器。我们使用SMOTE方法处理类别不平衡问题。

方法

该数据库包含3397名用于评估睾酮缺乏情况的患者。其中,1886名2型糖尿病患者被纳入研究。在数据预处理阶段,首先,使用局部离群因子(LOF)进行离群值/过度观测分析,并使用随机森林进行缺失值分析。SMOTE是一种生成少数类合成样本的方法。四个基本分类器,即多层感知器(MLP)、随机森林(RF)、极端学习机(ELM)和逻辑回归(LR),被用作一级分类器。树集成分类器,即自适应提升(ADA)、XGBoost和梯度提升机(SGB),被用作二级分类器。

结果

应用SMOTE后,除ELM外的所有基础分类器的诊断准确率均下降,但所有分类器的灵敏度值均增加。同样,所有分类器的特异性值下降,而F1分数增加。RF分类器在基础训练数据集上取得了更成功的结果。训练数据集中最成功的集成分类器在原始数据和SMOTE数据中都是ADA分类器。就测试数据而言,XGBoost是评估模型性能最适合您预期用途的模型。XGBoost在使用SMOTE时表现出平衡的性能,尤其适合用于纠正类别不平衡。

结论

SMOTE用于纠正原始数据中的类别不平衡。然而,如本研究所示,应用SMOTE时,一些模型的诊断准确率下降,但灵敏度显著提高。这表明SMOTE在更好地预测少数类方面具有积极作用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0668/11640355/7117f02cedc2/diagnostics-14-02634-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验