Suppr超能文献

基于特征的集成建模,使用SMOTE、RUS和随机森林方法解决糖尿病数据不平衡问题:一项预测研究。

Feature-based ensemble modeling for addressing diabetes data imbalance using the SMOTE, RUS, and random forest methods: a prediction study.

作者信息

Jang Younseo

机构信息

College of Medicine, Ewha Womans University, Seoul, Korea.

出版信息

Ewha Med J. 2025 Apr;48(2):e32. doi: 10.12771/emj.2025.00353. Epub 2025 Apr 15.

Abstract

PURPOSE

This study developed and evaluated a feature-based ensemble model integrating the synthetic minority oversampling technique (SMOTE) and random undersampling (RUS) methods with a random forest approach to address class imbalance in machine learning for early diabetes detection, aiming to improve predictive performance.

METHODS

Using the Scikit-learn diabetes dataset (442 samples, 10 features), we binarized the target variable (diabetes progression) at the 75th percentile and split it 80:20 using stratified sampling. The training set was balanced to a 1:2 minority-to-majority ratio via SMOTE (0.6) and RUS (0.66). A feature-based ensemble model was constructed by training random forest classifiers on 10 two-feature subsets, selected based on feature importance, and combining their outputs using soft voting. Performance was compared against 13 baseline models, using accuracy and area under the curve (AUC) as metrics on the imbalanced test set.

RESULTS

The feature-based ensemble model and balanced random forest both achieved the highest accuracy (0.8764), followed by the fully connected neural network (0.8700). The ensemble model had an excellent AUC (0.9227), while k-nearest neighbors had the lowest accuracy (0.8427). Visualizations confirmed its superior discriminative ability, especially for the minority (high-risk) class, which is a critical factor in medical contexts.

CONCLUSION

Integrating SMOTE, RUS, and feature-based ensemble learning improved classification performance in imbalanced diabetes datasets by delivering robust accuracy and high recall for the minority class. This approach outperforms traditional resampling techniques and deep learning models, offering a scalable and interpretable solution for early diabetes prediction and potentially other medical applications.

摘要

目的

本研究开发并评估了一种基于特征的集成模型,该模型将合成少数过采样技术(SMOTE)和随机欠采样(RUS)方法与随机森林方法相结合,以解决机器学习中早期糖尿病检测的类别不平衡问题,旨在提高预测性能。

方法

使用Scikit-learn糖尿病数据集(442个样本,10个特征),我们将目标变量(糖尿病进展)在第75百分位数处进行二值化,并使用分层抽样将其按80:20分割。通过SMOTE(0.6)和RUS(0.66)将训练集平衡为1:2的少数类与多数类比例。基于特征的集成模型是通过在10个基于特征重要性选择的双特征子集上训练随机森林分类器,并使用软投票组合它们的输出而构建的。在不平衡测试集上,以准确率和曲线下面积(AUC)作为指标,将性能与13个基线模型进行比较。

结果

基于特征的集成模型和平衡随机森林均达到最高准确率(0.8764),其次是全连接神经网络(0.8700)。集成模型具有出色的AUC(0.9227),而k近邻的准确率最低(0.8427)。可视化证实了其卓越的判别能力,特别是对于少数(高风险)类,这在医学背景下是一个关键因素。

结论

集成SMOTE、RUS和基于特征的集成学习通过为少数类提供稳健的准确率和高召回率,提高了不平衡糖尿病数据集中的分类性能。这种方法优于传统的重采样技术和深度学习模型,为早期糖尿病预测以及潜在的其他医学应用提供了一种可扩展且可解释的解决方案。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c19e/12277495/2c0a151f6d05/emj-2025-00353f1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验