类别不均衡医学数据集的数据离散化与数据重采样之间的交互作用。

Interaction effect between data discretization and data resampling for class-imbalanced medical datasets.

作者信息

Huang Min-Wei, Tsai Chih-Fong, Lin Wei-Chao, Lin Jia-Yang

机构信息

Kaohsiung Municipal Kai-Syuan Psychiatric Hospital, Kaohsiung.

Department of Physical Therapy and Graduate Institute of Rehabilitation Science, China Medical University, Taichung.

出版信息

Technol Health Care. 2025 Mar;33(2):1000-1013. doi: 10.1177/09287329241295874. Epub 2024 Nov 25.

DOI:10.1177/09287329241295874

PMID:40105161

Abstract

BackgroundData discretization is an important preprocessing step in data mining for the transfer of continuous feature values to discrete ones, which allows some specific data mining algorithms to construct more effective models and facilitates the data mining process. Because many medical domain datasets are class imbalanced, data resampling methods, including oversampling, undersampling, and hybrid sampling methods, have been widely applied to rebalance the training set, facilitating effective differentiation between majority and minority classes.ObjectiveHerein, we examine the effect of incorporating both data discretization and data resampling as steps in the analytical process on the classifier performance for class-imbalanced medical datasets. The order in which these two steps are carried out is compared in the experiments.MethodsTwo experimental studies were conducted, one based on 11 two-class imbalanced medical datasets and the other using 3 multiclass imbalanced medical datasets. In addition, the two discretization algorithms employed are ChiMerge and minimum description length principle (MDLP). On the other hand, the data resampling algorithms chosen for performance comparison are Tomek links undersampling, synthetic minority oversampling technique (SMOTE) oversampling, and SMOTE-Tomek hybrid sampling algorithms. Moreover, the support vector machine (SVM), C4.5 decision tree, and random forest (RF) techniques were used to examine the classification performances of the different approaches.ResultsThe results show that on average, the combination approaches can allow the classifiers to provide higher area under the ROC curve (AUC) rates than the best baseline approach at approximately 0.8%-3.5% and 0.9%-2.5% for twoclass and multiclass imbalanced medical datasets, respectively. Particularly, the optimal results for two-class imbalanced datasets are obtained by performing the MDLP method first for data discretization and SMOTE second for oversampling, providing the highest AUC rate and requiring the least computational cost. For multiclass imbalanced datasets, performing SMOTE or SMOTE-Tomek first for data resampling and ChiMerge second for data discretization offers the best performances.ConclusionsClassifiers with oversampling can provide better performances than the baseline method without oversampling. In contrast, performing data discretization does not necessarily make the classifiers outperform the baselines. On average, the combination approaches have potential to allow the classifiers to provide higher AUC rates than the best baseline approach.

摘要

背景

数据离散化是数据挖掘中的一个重要预处理步骤，用于将连续特征值转换为离散特征值，这使得一些特定的数据挖掘算法能够构建更有效的模型，并促进数据挖掘过程。由于许多医学领域的数据集存在类别不平衡问题，数据重采样方法，包括过采样、欠采样和混合采样方法，已被广泛应用于重新平衡训练集，以促进对多数类和少数类的有效区分。

目的

在此，我们研究在分析过程中同时纳入数据离散化和数据重采样步骤对类别不平衡医学数据集分类器性能的影响。在实验中比较了这两个步骤的执行顺序。

方法

进行了两项实验研究，一项基于11个二类不平衡医学数据集，另一项使用3个多类不平衡医学数据集。此外，所采用的两种离散化算法是卡方合并（ChiMerge）和最小描述长度原则（MDLP）。另一方面，选择用于性能比较的数据重采样算法是托梅克链接（Tomek links）欠采样、合成少数类过采样技术（SMOTE）过采样以及SMOTE - 托梅克混合采样算法。此外，使用支持向量机（SVM）、C4.5决策树和随机森林（RF）技术来检验不同方法的分类性能。

结果

结果表明，平均而言，对于二类和多类不平衡医学数据集，组合方法能使分类器提供比最佳基线方法更高的受试者工作特征曲线下面积（AUC）率，分别约高0.8% - 3.5%和0.9% - 2.5%。特别是，对于二类不平衡数据集，通过先执行MDLP方法进行数据离散化，然后执行SMOTE进行过采样可获得最佳结果，提供最高的AUC率且计算成本最低。对于多类不平衡数据集，先执行SMOTE或SMOTE - 托梅克进行数据重采样，然后执行卡方合并进行数据离散化可提供最佳性能。

结论

进行过采样的分类器比未进行过采样的基线方法能提供更好的性能。相比之下，执行数据离散化不一定能使分类器优于基线。平均而言，组合方法有可能使分类器提供比最佳基线方法更高的AUC率。

相似文献

Interaction effect between data discretization and data resampling for class-imbalanced medical datasets.类别不均衡医学数据集的数据离散化与数据重采样之间的交互作用。

Technol Health Care. 2025 Mar;33(2):1000-1013. doi: 10.1177/09287329241295874. Epub 2024 Nov 25.

Combining data discretization and missing value imputation for incomplete medical datasets.对不完整的医学数据集进行数据离散化和缺失值插补的组合。

PLoS One. 2023 Nov 30;18(11):e0295032. doi: 10.1371/journal.pone.0295032. eCollection 2023.

Data Augmentation and Machine Learning algorithms for multi-class imbalanced morphometrics data of stingless bees.用于无刺蜂多类不平衡形态测量数据的数据增强和机器学习算法

Heliyon. 2025 Jan 23;11(3):e42214. doi: 10.1016/j.heliyon.2025.e42214. eCollection 2025 Feb 15.

Joint modeling strategy for using electronic medical records data to build machine learning models: an example of intracerebral hemorrhage.利用电子病历数据构建机器学习模型的联合建模策略：以脑出血为例。

BMC Med Inform Decis Mak. 2022 Oct 25;22(1):278. doi: 10.1186/s12911-022-02018-x.

RSMOTE: improving classification performance over imbalanced medical datasets.RSMOTE：提升不平衡医学数据集的分类性能

Health Inf Sci Syst. 2020 Jun 12;8(1):22. doi: 10.1007/s13755-020-00112-w. eCollection 2020 Dec.

Structure-activity relationship-based chemical classification of highly imbalanced Tox21 datasets.基于结构-活性关系的高度不平衡Tox21数据集的化学分类

J Cheminform. 2020 Oct 27;12(1):66. doi: 10.1186/s13321-020-00468-x.

Improving Surgical Site Infection Prediction Using Machine Learning: Addressing Challenges of Highly Imbalanced Data.使用机器学习改善手术部位感染预测：应对高度不平衡数据的挑战。

Diagnostics (Basel). 2025 Feb 19;15(4):501. doi: 10.3390/diagnostics15040501.

Comparison of Resampling Techniques for Imbalanced Datasets in Machine Learning: Application to Epileptogenic Zone Localization From Interictal Intracranial EEG Recordings in Patients With Focal Epilepsy.机器学习中不平衡数据集的重采样技术比较：在局灶性癫痫患者发作间期颅内脑电图记录的致痫区定位中的应用

Front Neuroinform. 2021 Nov 19;15:715421. doi: 10.3389/fninf.2021.715421. eCollection 2021.

Tomek Link and SMOTE Approaches for Machine Fault Classification with an Imbalanced Dataset.基于不平衡数据集的机器故障分类的 Tomk Link 和 SMOTE 方法。

Sensors (Basel). 2022 Apr 23;22(9):3246. doi: 10.3390/s22093246.

A Synthetic Minority Oversampling Technique Based on Gaussian Mixture Model Filtering for Imbalanced Data Classification.一种基于高斯混合模型滤波的合成少数类过采样技术用于不平衡数据分类

IEEE Trans Neural Netw Learn Syst. 2024 Mar;35(3):3740-3753. doi: 10.1109/TNNLS.2022.3197156. Epub 2024 Feb 29.

引用本文的文献

Protein Spatial Structure Meets Artificial Intelligence: Revolutionizing Drug Synergy-Antagonism in Precision Medicine.蛋白质空间结构与人工智能相遇：革新精准医学中的药物协同 - 拮抗作用

Adv Sci (Weinh). 2025 Sep;12(33):e07764. doi: 10.1002/advs.202507764. Epub 2025 Aug 7.

Enhancing patient rehabilitation outcomes: artificial intelligence-driven predictive modeling for home discharge in neurological and orthopedic conditions.提高患者康复效果：针对神经科和骨科疾病出院居家情况的人工智能驱动预测模型

J Neuroeng Rehabil. 2025 May 26;22(1):117. doi: 10.1186/s12984-025-01654-4.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

类别不均衡医学数据集的数据离散化与数据重采样之间的交互作用。

Interaction effect between data discretization and data resampling for class-imbalanced medical datasets.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献