Suppr超能文献

类别不均衡医学数据集的数据离散化与数据重采样之间的交互作用。

Interaction effect between data discretization and data resampling for class-imbalanced medical datasets.

作者信息

Huang Min-Wei, Tsai Chih-Fong, Lin Wei-Chao, Lin Jia-Yang

机构信息

Kaohsiung Municipal Kai-Syuan Psychiatric Hospital, Kaohsiung.

Department of Physical Therapy and Graduate Institute of Rehabilitation Science, China Medical University, Taichung.

出版信息

Technol Health Care. 2025 Mar;33(2):1000-1013. doi: 10.1177/09287329241295874. Epub 2024 Nov 25.

Abstract

BackgroundData discretization is an important preprocessing step in data mining for the transfer of continuous feature values to discrete ones, which allows some specific data mining algorithms to construct more effective models and facilitates the data mining process. Because many medical domain datasets are class imbalanced, data resampling methods, including oversampling, undersampling, and hybrid sampling methods, have been widely applied to rebalance the training set, facilitating effective differentiation between majority and minority classes.ObjectiveHerein, we examine the effect of incorporating both data discretization and data resampling as steps in the analytical process on the classifier performance for class-imbalanced medical datasets. The order in which these two steps are carried out is compared in the experiments.MethodsTwo experimental studies were conducted, one based on 11 two-class imbalanced medical datasets and the other using 3 multiclass imbalanced medical datasets. In addition, the two discretization algorithms employed are ChiMerge and minimum description length principle (MDLP). On the other hand, the data resampling algorithms chosen for performance comparison are Tomek links undersampling, synthetic minority oversampling technique (SMOTE) oversampling, and SMOTE-Tomek hybrid sampling algorithms. Moreover, the support vector machine (SVM), C4.5 decision tree, and random forest (RF) techniques were used to examine the classification performances of the different approaches.ResultsThe results show that on average, the combination approaches can allow the classifiers to provide higher area under the ROC curve (AUC) rates than the best baseline approach at approximately 0.8%-3.5% and 0.9%-2.5% for twoclass and multiclass imbalanced medical datasets, respectively. Particularly, the optimal results for two-class imbalanced datasets are obtained by performing the MDLP method first for data discretization and SMOTE second for oversampling, providing the highest AUC rate and requiring the least computational cost. For multiclass imbalanced datasets, performing SMOTE or SMOTE-Tomek first for data resampling and ChiMerge second for data discretization offers the best performances.ConclusionsClassifiers with oversampling can provide better performances than the baseline method without oversampling. In contrast, performing data discretization does not necessarily make the classifiers outperform the baselines. On average, the combination approaches have potential to allow the classifiers to provide higher AUC rates than the best baseline approach.

摘要

背景

数据离散化是数据挖掘中的一个重要预处理步骤,用于将连续特征值转换为离散特征值,这使得一些特定的数据挖掘算法能够构建更有效的模型,并促进数据挖掘过程。由于许多医学领域的数据集存在类别不平衡问题,数据重采样方法,包括过采样、欠采样和混合采样方法,已被广泛应用于重新平衡训练集,以促进对多数类和少数类的有效区分。

目的

在此,我们研究在分析过程中同时纳入数据离散化和数据重采样步骤对类别不平衡医学数据集分类器性能的影响。在实验中比较了这两个步骤的执行顺序。

方法

进行了两项实验研究,一项基于11个二类不平衡医学数据集,另一项使用3个多类不平衡医学数据集。此外,所采用的两种离散化算法是卡方合并(ChiMerge)和最小描述长度原则(MDLP)。另一方面,选择用于性能比较的数据重采样算法是托梅克链接(Tomek links)欠采样、合成少数类过采样技术(SMOTE)过采样以及SMOTE - 托梅克混合采样算法。此外,使用支持向量机(SVM)、C4.5决策树和随机森林(RF)技术来检验不同方法的分类性能。

结果

结果表明,平均而言,对于二类和多类不平衡医学数据集,组合方法能使分类器提供比最佳基线方法更高的受试者工作特征曲线下面积(AUC)率,分别约高0.8% - 3.5%和0.9% - 2.5%。特别是,对于二类不平衡数据集,通过先执行MDLP方法进行数据离散化,然后执行SMOTE进行过采样可获得最佳结果,提供最高的AUC率且计算成本最低。对于多类不平衡数据集,先执行SMOTE或SMOTE - 托梅克进行数据重采样,然后执行卡方合并进行数据离散化可提供最佳性能。

结论

进行过采样的分类器比未进行过采样的基线方法能提供更好的性能。相比之下,执行数据离散化不一定能使分类器优于基线。平均而言,组合方法有可能使分类器提供比最佳基线方法更高的AUC率。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验