• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

干豆特征不平衡数据的聚类与分类

Clustering and classification for dry bean feature imbalanced data.

作者信息

Lee Chou-Yuan, Wang Wei, Huang Jian-Qiong

机构信息

School of Big Data, Fuzhou University of International Studies and Trade, Fuzhou, 350202, China.

School of Software, Yunnan University, Kunming, 650000, China.

出版信息

Sci Rep. 2024 Dec 28;14(1):31058. doi: 10.1038/s41598-024-82253-6.

DOI:10.1038/s41598-024-82253-6
PMID:39730714
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11681048/
Abstract

The traditional machine learning methods such as decision tree (DT), random forest (RF), and support vector machine (SVM) have low classification performance. This paper proposes an algorithm for the dry bean dataset and obesity levels dataset that can balance the minority class and the majority class and has a clustering function to improve the traditional machine learning classification accuracy and various performance indicators such as precision, recall, f1-score, and area under curve (AUC) for imbalanced data. The key idea is to use the advantages of borderline-synthetic minority oversampling technique (BLSMOTE) to generate new samples using samples on the boundary of minority class samples to reduce the impact of noise on model building, and the advantages of K-means clustering to divide data into different groups according to similarities or common features. The results show that the proposed algorithm BLSMOTE + K-means + SVM is superior to other traditional machine learning methods in classification and various performance indicators. The BLSMOTE + K-means + DT generates decision rules for the dry bean dataset and the the obesity levels dataset, and the BLSMOTE + K-means + RF ranks the importance of explanatory variables. These experimental results can provide scientific evidence for decision-makers.

摘要

传统的机器学习方法,如决策树(DT)、随机森林(RF)和支持向量机(SVM),分类性能较低。本文针对干豆数据集和肥胖水平数据集提出了一种算法,该算法可以平衡少数类和多数类,并且具有聚类功能,以提高传统机器学习在不平衡数据上的分类准确率以及各种性能指标,如精确率、召回率、F1分数和曲线下面积(AUC)。关键思想是利用边界合成少数类过采样技术(BLSMOTE)的优势,使用少数类样本边界上的样本生成新样本,以减少噪声对模型构建的影响,以及利用K均值聚类的优势,根据相似性或共同特征将数据划分为不同的组。结果表明,所提出的算法BLSMOTE + K均值 + SVM在分类和各种性能指标方面优于其他传统机器学习方法。BLSMOTE + K均值 + DT为干豆数据集和肥胖水平数据集生成决策规则,而BLSMOTE + K均值 + RF对解释变量的重要性进行排序。这些实验结果可以为决策者提供科学依据。

相似文献

1
Clustering and classification for dry bean feature imbalanced data.干豆特征不平衡数据的聚类与分类
Sci Rep. 2024 Dec 28;14(1):31058. doi: 10.1038/s41598-024-82253-6.
2
Prediction and feature selection of low birth weight using machine learning algorithms.利用机器学习算法预测和选择低出生体重。
J Health Popul Nutr. 2024 Oct 12;43(1):157. doi: 10.1186/s41043-024-00647-8.
3
Addressing Imbalanced Classification Problems in Drug Discovery and Development Using Random Forest, Support Vector Machine, AutoGluon-Tabular, and H2O AutoML.使用随机森林、支持向量机、AutoGluon-Tabular和H2O自动机器学习解决药物发现与开发中的不平衡分类问题。
J Chem Inf Model. 2025 Apr 28;65(8):3976-3989. doi: 10.1021/acs.jcim.5c00023. Epub 2025 Apr 15.
4
Hybrid statistical and machine-learning approach to hearing-loss identification based on an oversampling technique.基于过采样技术的听力损失识别混合统计与机器学习方法。
Comput Biol Med. 2025 Feb;185:109539. doi: 10.1016/j.compbiomed.2024.109539. Epub 2024 Dec 12.
5
Joint modeling strategy for using electronic medical records data to build machine learning models: an example of intracerebral hemorrhage.利用电子病历数据构建机器学习模型的联合建模策略:以脑出血为例。
BMC Med Inform Decis Mak. 2022 Oct 25;22(1):278. doi: 10.1186/s12911-022-02018-x.
6
Comparison of Supervised Machine Learning Algorithms for Classifying of Home Discharge Possibility in Convalescent Stroke Patients: A Secondary Analysis.基于机器学习的监督算法在恢复期脑卒中患者居家康复可能性分类中的比较:二次分析。
J Stroke Cerebrovasc Dis. 2021 Oct;30(10):106011. doi: 10.1016/j.jstrokecerebrovasdis.2021.106011. Epub 2021 Jul 26.
7
Data Augmentation and Machine Learning algorithms for multi-class imbalanced morphometrics data of stingless bees.用于无刺蜂多类不平衡形态测量数据的数据增强和机器学习算法
Heliyon. 2025 Jan 23;11(3):e42214. doi: 10.1016/j.heliyon.2025.e42214. eCollection 2025 Feb 15.
8
Development of an efficient novel method for coronary artery disease prediction using machine learning and deep learning techniques.利用机器学习和深度学习技术开发一种用于冠心病预测的高效新方法。
Technol Health Care. 2024;32(6):4545-4569. doi: 10.3233/THC-240740.
9
[Constructing a predictive model for the death risk of patients with septic shock based on supervised machine learning algorithms].基于监督机器学习算法构建脓毒症休克患者死亡风险预测模型
Zhonghua Wei Zhong Bing Ji Jiu Yi Xue. 2024 Apr;36(4):345-352. doi: 10.3760/cma.j.cn121430-20230930-00832.
10
Interaction effect between data discretization and data resampling for class-imbalanced medical datasets.类别不均衡医学数据集的数据离散化与数据重采样之间的交互作用。
Technol Health Care. 2025 Mar;33(2):1000-1013. doi: 10.1177/09287329241295874. Epub 2024 Nov 25.

本文引用的文献

1
Predicting and improving complex beer flavor through machine learning.通过机器学习预测和改善复杂啤酒风味。
Nat Commun. 2024 Mar 26;15(1):2368. doi: 10.1038/s41467-024-46346-0.
2
Developing a machine learning model for accurate nucleoside hydrogels prediction based on descriptors.基于描述符开发用于准确预测核苷水凝胶的机器学习模型。
Nat Commun. 2024 Mar 23;15(1):2603. doi: 10.1038/s41467-024-46866-9.
3
Color-CADx: a deep learning approach for colorectal cancer classification through triple convolutional neural networks and discrete cosine transform.
Color-CADx:一种基于三卷积神经网络和离散余弦变换的结直肠癌分类深度学习方法。
Sci Rep. 2024 Mar 22;14(1):6914. doi: 10.1038/s41598-024-56820-w.
4
Label-aware distance mitigates temporal and spatial variability for clustering and visualization of single-cell gene expression data.带标签的距离缓解了单细胞基因表达数据聚类和可视化的时间和空间可变性。
Commun Biol. 2024 Mar 14;7(1):326. doi: 10.1038/s42003-024-05988-y.
5
Climate-induced tree-mortality pulses are obscured by broad-scale and long-term greening.气候引发的树木死亡脉冲被大范围和长期的绿化所掩盖。
Nat Ecol Evol. 2024 May;8(5):912-923. doi: 10.1038/s41559-024-02372-1. Epub 2024 Mar 11.
6
Augmented weighted K-means grey wolf optimizer: An enhanced metaheuristic algorithm for data clustering problems.增强加权K均值灰狼优化算法:一种用于数据聚类问题的增强型元启发式算法。
Sci Rep. 2024 Mar 5;14(1):5434. doi: 10.1038/s41598-024-55619-z.
7
Multi-source information fusion-driven corn yield prediction using the Random Forest from the perspective of Agricultural and Forestry Economic Management.基于农林经济管理视角的多源信息融合驱动随机森林玉米产量预测
Sci Rep. 2024 Feb 19;14(1):4052. doi: 10.1038/s41598-024-54354-9.
8
Seismic landslide susceptibility assessment using principal component analysis and support vector machine.基于主成分分析和支持向量机的地震滑坡敏感性评估
Sci Rep. 2024 Feb 14;14(1):3734. doi: 10.1038/s41598-023-48196-0.
9
A decision-making tree for policy responses to a pathogen with pandemic potential.针对具有大流行潜力病原体的政策应对决策树。
Nat Med. 2024 Feb;30(2):327-329. doi: 10.1038/s41591-023-02755-0.
10
Mental health and natural land cover: a global analysis based on random forest with geographical consideration.心理健康与自然土地覆盖:基于随机森林并考虑地理因素的全球分析
Sci Rep. 2024 Feb 5;14(1):2894. doi: 10.1038/s41598-024-53279-7.