• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

一种基于高斯混合模型滤波的合成少数类过采样技术用于不平衡数据分类

A Synthetic Minority Oversampling Technique Based on Gaussian Mixture Model Filtering for Imbalanced Data Classification.

作者信息

Xu Zhaozhao, Shen Derong, Kou Yue, Nie Tiezheng

出版信息

IEEE Trans Neural Netw Learn Syst. 2024 Mar;35(3):3740-3753. doi: 10.1109/TNNLS.2022.3197156. Epub 2024 Feb 29.

DOI:10.1109/TNNLS.2022.3197156
PMID:35984792
Abstract

Data imbalance is a common phenomenon in machine learning. In the imbalanced data classification, minority samples are far less than majority samples, which makes it difficult for minority to be effectively learned by classifiers. A synthetic minority oversampling technique (SMOTE) improves the sensitivity of classifiers to minority by synthesizing minority samples without repetition. However, the process of synthesizing new samples in the SMOTE algorithm may lead to problems such as "noisy samples" and "boundary samples." Based on the above description, we propose a synthetic minority oversampling technique based on Gaussian mixture model filtering (GMF-SMOTE). GMF-SMOTE uses the expected maximum algorithm based on the Gaussian mixture model to group the imbalanced data. Then, the expected maximum filtering algorithm is used to filter out the "noisy samples" and "boundary samples" in the subclasses after grouping. Finally, to synthesize majority and minority samples, we design two dynamic oversampling ratios. Experimental results show that the GMF-SMOTE performs better than the traditional oversampling algorithms on 20 UCI datasets. The population averages of sensitivity and specificity indexes of random forest (RF) on the UCI datasets synthesized by GMF-SMOTE are 97.49% and 97.02%, respectively. In addition, we also record the G-mean and MCC indexes of the RF, which are 97.32% and 94.80%, respectively, significantly better than the traditional oversampling algorithms. More importantly, the two statistical tests show that GMF-SMOTE is significantly better than the traditional oversampling algorithms.

摘要

数据不平衡是机器学习中的常见现象。在不平衡数据分类中,少数类样本远少于多数类样本,这使得分类器难以有效学习少数类样本。合成少数类过采样技术(SMOTE)通过合成无重复的少数类样本提高了分类器对少数类的敏感性。然而,SMOTE算法中合成新样本的过程可能会导致“噪声样本”和“边界样本”等问题。基于上述描述,我们提出了一种基于高斯混合模型滤波的合成少数类过采样技术(GMF-SMOTE)。GMF-SMOTE使用基于高斯混合模型的期望最大化算法对不平衡数据进行分组。然后,使用期望最大化滤波算法在分组后的子类中滤除“噪声样本”和“边界样本”。最后,为了合成多数类和少数类样本,我们设计了两个动态过采样率。实验结果表明,GMF-SMOTE在20个UCI数据集上的性能优于传统过采样算法。在GMF-SMOTE合成的UCI数据集上,随机森林(RF)的敏感性和特异性指标的总体平均值分别为97.49%和97.02%。此外,我们还记录了RF的G均值和MCC指标,分别为97.32%和94.80%,明显优于传统过采样算法。更重要的是,两项统计测试表明GMF-SMOTE明显优于传统过采样算法。

相似文献

1
A Synthetic Minority Oversampling Technique Based on Gaussian Mixture Model Filtering for Imbalanced Data Classification.一种基于高斯混合模型滤波的合成少数类过采样技术用于不平衡数据分类
IEEE Trans Neural Netw Learn Syst. 2024 Mar;35(3):3740-3753. doi: 10.1109/TNNLS.2022.3197156. Epub 2024 Feb 29.
2
Outlier-SMOTE: A refined oversampling technique for improved detection of COVID-19.异常值合成少数过采样技术(Outlier-SMOTE):一种用于改进新冠病毒(COVID-19)检测的精细过采样技术。
Intell Based Med. 2020 Dec;3:100023. doi: 10.1016/j.ibmed.2020.100023. Epub 2020 Dec 3.
3
A hybrid sampling algorithm combining M-SMOTE and ENN based on Random forest for medical imbalanced data.基于随机森林的 M-SMOTE 与ENN 混合采样算法在医学不平衡数据中的应用
J Biomed Inform. 2020 Jul;107:103465. doi: 10.1016/j.jbi.2020.103465. Epub 2020 Jun 5.
4
SMOTE for high-dimensional class-imbalanced data.过采样处理高维类别不平衡数据。
BMC Bioinformatics. 2013 Mar 22;14:106. doi: 10.1186/1471-2105-14-106.
5
RSMOTE: improving classification performance over imbalanced medical datasets.RSMOTE:提升不平衡医学数据集的分类性能
Health Inf Sci Syst. 2020 Jun 12;8(1):22. doi: 10.1007/s13755-020-00112-w. eCollection 2020 Dec.
6
CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests.基于随机森林的用于特征选择和参数优化的CURE-SMOTE算法及混合算法。
BMC Bioinformatics. 2017 Mar 14;18(1):169. doi: 10.1186/s12859-017-1578-z.
7
A self-inspected adaptive SMOTE algorithm (SASMOTE) for highly imbalanced data classification in healthcare.一种用于医疗保健中高度不平衡数据分类的自检测自适应合成少数过采样技术算法(SASMOTE)。
BioData Min. 2023 Apr 25;16(1):15. doi: 10.1186/s13040-023-00330-4.
8
A novel method for detecting credit card fraud problems.一种用于检测信用卡欺诈问题的新方法。
PLoS One. 2024 Mar 6;19(3):e0294537. doi: 10.1371/journal.pone.0294537. eCollection 2024.
9
Classification of Imbalanced Data by Oversampling in Kernel Space of Support Vector Machines.支持向量机核空间中基于过采样的不平衡数据分类
IEEE Trans Neural Netw Learn Syst. 2018 Sep;29(9):4065-4076. doi: 10.1109/TNNLS.2017.2751612. Epub 2017 Oct 10.
10
STB: synthetic minority oversampling technique for tree-boosting models for imbalanced datasets of intrusion detection systems.STB:用于入侵检测系统不平衡数据集的树增强模型的合成少数类过采样技术。
PeerJ Comput Sci. 2023 Nov 27;9:e1580. doi: 10.7717/peerj-cs.1580. eCollection 2023.

引用本文的文献

1
Development and validation of a hypoxemia prediction model in middle-aged and elderly outpatients undergoing painless gastroscopy.中老年无痛胃镜检查患者低氧血症预测模型的建立与验证
Sci Rep. 2025 May 23;15(1):17965. doi: 10.1038/s41598-025-02540-8.
2
Machine learning prediction of pathological complete response to neoadjuvant chemotherapy with peritumoral breast tumor ultrasound radiomics: compare with intratumoral radiomics and clinicopathologic predictors.基于瘤周乳腺肿瘤超声影像组学的新辅助化疗病理完全缓解的机器学习预测:与瘤内影像组学及临床病理预测指标的比较
Breast Cancer Res Treat. 2025 May 16. doi: 10.1007/s10549-025-07727-1.
3
The association of neutrophil-to-lymphocyte ratio with post-chemotherapy pulmonary infection in lung cancer patients.
中性粒细胞与淋巴细胞比值与肺癌患者化疗后肺部感染的相关性
Front Med (Lausanne). 2025 Apr 9;12:1559702. doi: 10.3389/fmed.2025.1559702. eCollection 2025.
4
An interpreting machine learning models to predict amputation risk in patients with diabetic foot ulcers: a multi-center study.一种用于预测糖尿病足溃疡患者截肢风险的解释性机器学习模型:一项多中心研究。
Front Endocrinol (Lausanne). 2025 Mar 25;16:1526098. doi: 10.3389/fendo.2025.1526098. eCollection 2025.
5
Predicting Neoplastic Polyp in Patients With Gallbladder Polyps Using Interpretable Machine Learning Models: Retrospective Cohort Study.使用可解释机器学习模型预测胆囊息肉患者的肿瘤性息肉:回顾性队列研究
Cancer Med. 2025 Mar;14(5):e70739. doi: 10.1002/cam4.70739.
6
A risk prediction model for venous thromboembolism in hospitalized patients with thoracic trauma: a machine learning, national multicenter retrospective study.胸部创伤住院患者静脉血栓栓塞症的风险预测模型:一项机器学习全国多中心回顾性研究。
World J Emerg Surg. 2025 Feb 13;20(1):14. doi: 10.1186/s13017-025-00583-w.
7
Predicting axillary response to neoadjuvant chemotherapy using peritumoral and intratumoral ultrasound radiomics in breast cancer subtypes.利用瘤周和瘤内超声放射组学预测乳腺癌亚型新辅助化疗的腋窝反应
iScience. 2024 Aug 13;27(9):110716. doi: 10.1016/j.isci.2024.110716. eCollection 2024 Sep 20.
8
Machine learning-enabled prediction of prolonged length of stay in hospital after surgery for tuberculosis spondylitis patients with unbalanced data: a novel approach using explainable artificial intelligence (XAI).机器学习在数据不平衡的情况下预测脊柱结核手术后住院时间延长的预测:一种使用可解释人工智能 (XAI) 的新方法。
Eur J Med Res. 2024 Jul 25;29(1):383. doi: 10.1186/s40001-024-01988-0.
9
Intelligent Stroke Disease Prediction Model Using Deep Learning Approaches.使用深度学习方法的智能中风疾病预测模型
Stroke Res Treat. 2024 May 23;2024:4523388. doi: 10.1155/2024/4523388. eCollection 2024.
10
Interpretable machine learning framework to predict gout associated with dietary fiber and triglyceride-glucose index.用于预测与膳食纤维和甘油三酯-葡萄糖指数相关痛风的可解释机器学习框架。
Nutr Metab (Lond). 2024 May 14;21(1):25. doi: 10.1186/s12986-024-00802-2.