Lee Chou-Yuan, Wang Wei, Huang Jian-Qiong
School of Big Data, Fuzhou University of International Studies and Trade, Fuzhou, 350202, China.
School of Software, Yunnan University, Kunming, 650000, China.
Sci Rep. 2024 Dec 28;14(1):31058. doi: 10.1038/s41598-024-82253-6.
The traditional machine learning methods such as decision tree (DT), random forest (RF), and support vector machine (SVM) have low classification performance. This paper proposes an algorithm for the dry bean dataset and obesity levels dataset that can balance the minority class and the majority class and has a clustering function to improve the traditional machine learning classification accuracy and various performance indicators such as precision, recall, f1-score, and area under curve (AUC) for imbalanced data. The key idea is to use the advantages of borderline-synthetic minority oversampling technique (BLSMOTE) to generate new samples using samples on the boundary of minority class samples to reduce the impact of noise on model building, and the advantages of K-means clustering to divide data into different groups according to similarities or common features. The results show that the proposed algorithm BLSMOTE + K-means + SVM is superior to other traditional machine learning methods in classification and various performance indicators. The BLSMOTE + K-means + DT generates decision rules for the dry bean dataset and the the obesity levels dataset, and the BLSMOTE + K-means + RF ranks the importance of explanatory variables. These experimental results can provide scientific evidence for decision-makers.
传统的机器学习方法,如决策树(DT)、随机森林(RF)和支持向量机(SVM),分类性能较低。本文针对干豆数据集和肥胖水平数据集提出了一种算法,该算法可以平衡少数类和多数类,并且具有聚类功能,以提高传统机器学习在不平衡数据上的分类准确率以及各种性能指标,如精确率、召回率、F1分数和曲线下面积(AUC)。关键思想是利用边界合成少数类过采样技术(BLSMOTE)的优势,使用少数类样本边界上的样本生成新样本,以减少噪声对模型构建的影响,以及利用K均值聚类的优势,根据相似性或共同特征将数据划分为不同的组。结果表明,所提出的算法BLSMOTE + K均值 + SVM在分类和各种性能指标方面优于其他传统机器学习方法。BLSMOTE + K均值 + DT为干豆数据集和肥胖水平数据集生成决策规则,而BLSMOTE + K均值 + RF对解释变量的重要性进行排序。这些实验结果可以为决策者提供科学依据。