评估和缓解机器学习中类不平衡的影响及其在 X 射线成像中的应用。

Assessing and mitigating the effects of class imbalance in machine learning with application to X-ray imaging.

机构信息

Department of Medical Imaging, University of Toronto, Toronto, ON, M5T 1W7, Canada.

Department of Mathematics, Statistics and Computer Science, St Francis Xavier University, Antigonish, NS, Canada.

出版信息

Int J Comput Assist Radiol Surg. 2020 Dec;15(12):2041-2048. doi: 10.1007/s11548-020-02260-6. Epub 2020 Sep 23.

DOI:10.1007/s11548-020-02260-6

PMID:32965624

Abstract

PURPOSE

Machine learning (ML) algorithms are well known to exhibit variations in prediction accuracy when provided with imbalanced training sets typically seen in medical imaging (MI) due to the imbalanced ratio of pathological and normal cases. This paper presents a thorough investigation of the effects of class imbalance and methods for mitigating class imbalance in ML algorithms applied to MI.

METHODS

We first selected five classes from the Image Retrieval in Medical Applications (IRMA) dataset, performed multiclass classification using the random forest model (RFM), and then performed binary classification using convolutional neural network (CNN) on a chest X-ray dataset. An imbalanced class was created in the training set by varying the number of images in that class. Methods tested to mitigate class imbalance included oversampling, undersampling, and changing class weights of the RFM. Model performance was assessed by overall classification accuracy, overall F1 score, and specificity, recall, and precision of the imbalanced class.

RESULTS

A close-to-balanced training set resulted in the best model performance, and a large imbalance with overrepresentation was more detrimental to model performance than underrepresentation. Oversampling and undersampling methods were both effective in mitigating class imbalance, and efficacy of oversampling techniques was class specific.

CONCLUSION

This study systematically demonstrates the effect of class imbalance on two public X-ray datasets on RFM and CNN, making these findings widely applicable as a reference. Furthermore, the methods employed here can guide researchers in assessing and addressing the effects of class imbalance, while considering the data-specific characteristics to optimize imbalance mitigating methods.

摘要

目的

机器学习 (ML) 算法在处理医学影像 (MI) 中常见的不平衡训练集时，由于病理和正常病例的不平衡比例，其预测准确性会出现变化，这是众所周知的。本文全面研究了不平衡类对 ML 算法在 MI 中应用的影响以及减轻不平衡类的方法。

方法

我们首先从图像检索在医学应用 (IRMA) 数据集中选择了五个类别，使用随机森林模型 (RFM) 进行多类别分类，然后在胸部 X 射线数据集上使用卷积神经网络 (CNN) 进行二进制分类。通过改变该类别的图像数量，在训练集中创建了一个不平衡类。为了减轻不平衡类的影响，我们测试了过采样、欠采样和改变 RFM 类权重的方法。模型性能通过整体分类准确性、整体 F1 得分以及不平衡类的特异性、召回率和精度来评估。

结果

接近平衡的训练集产生了最佳的模型性能，而过大的不平衡和过表示比欠表示对模型性能的影响更大。过采样和欠采样方法都能有效地减轻不平衡类的影响，并且过采样技术的效果是特定于类别的。

结论

本研究系统地展示了不平衡类对 RFM 和 CNN 两个公共 X 射线数据集的影响，这些发现具有广泛的适用性，可作为参考。此外，这里采用的方法可以指导研究人员评估和解决不平衡类的影响，同时考虑数据的特定特征，以优化不平衡缓解方法。

相似文献

Assessing and mitigating the effects of class imbalance in machine learning with application to X-ray imaging.

Int J Comput Assist Radiol Surg. 2020 Dec;15(12):2041-2048. doi: 10.1007/s11548-020-02260-6. Epub 2020 Sep 23.

A systematic study of the class imbalance problem in convolutional neural networks.

Neural Netw. 2018 Oct;106:249-259. doi: 10.1016/j.neunet.2018.07.011. Epub 2018 Jul 29.

Batch-balanced focal loss: a hybrid solution to class imbalance in deep learning.

J Med Imaging (Bellingham). 2023 Sep;10(5):051809. doi: 10.1117/1.JMI.10.5.051809. Epub 2023 Jun 23.

SVD-CLAHE boosting and balanced loss function for Covid-19 detection from an imbalanced Chest X-Ray dataset.

Comput Biol Med. 2022 Nov;150:106092. doi: 10.1016/j.compbiomed.2022.106092. Epub 2022 Sep 28.

Structure-activity relationship-based chemical classification of highly imbalanced Tox21 datasets.

J Cheminform. 2020 Oct 27;12(1):66. doi: 10.1186/s13321-020-00468-x.

Class imbalance should not throw you off balance: Choosing the right classifiers and performance metrics for brain decoding with imbalanced data.

Neuroimage. 2023 Aug 15;277:120253. doi: 10.1016/j.neuroimage.2023.120253. Epub 2023 Jun 28.

Conversion of adverse data corpus to shrewd output using sampling metrics.

Vis Comput Ind Biomed Art. 2020 Aug 11;3(1):19. doi: 10.1186/s42492-020-00055-9.

Comparative Studies on Resampling Techniques in Machine Learning and Deep Learning Models for Drug-Target Interaction Prediction.

Molecules. 2023 Feb 9;28(4):1663. doi: 10.3390/molecules28041663.

Quantifying uncertainty in machine learning classifiers for medical imaging.

Int J Comput Assist Radiol Surg. 2022 Apr;17(4):711-718. doi: 10.1007/s11548-022-02578-3. Epub 2022 Mar 12.

Addressing class imbalance in deep learning for small lesion detection on medical images.

Comput Biol Med. 2020 May;120:103735. doi: 10.1016/j.compbiomed.2020.103735. Epub 2020 Apr 1.

引用本文的文献

Tailoring task arithmetic to address bias in models trained on multi-institutional datasets.

J Biomed Inform. 2025 Aug;168:104858. doi: 10.1016/j.jbi.2025.104858. Epub 2025 Jun 8.

Predicting and interpreting key features of refractory Mycoplasma pneumoniae pneumonia using multiple machine learning methods.

Sci Rep. 2025 May 23;15(1):18029. doi: 10.1038/s41598-025-02962-4.

A multicenter validation and calibration of automated software package for detecting anterior circulation large vessel occlusion on CT angiography.

BMC Neurol. 2025 Mar 10;25(1):100. doi: 10.1186/s12883-025-04107-6.

An automatic deep-learning approach for the prediction of post-stroke epilepsy after an initial intracerebral hemorrhage based on non-contrast computed tomography imaging.

Quant Imaging Med Surg. 2025 Feb 1;15(2):1175-1189. doi: 10.21037/qims-24-1345. Epub 2025 Jan 21.

MeSH2Matrix: combining MeSH keywords and machine learning for biomedical relation classification based on PubMed.

J Biomed Semantics. 2024 Oct 2;15(1):18. doi: 10.1186/s13326-024-00319-w.

Optimizing Rare Disease Gait Classification through Data Balancing and Generative AI: Insights from Hereditary Cerebellar Ataxia.

Sensors (Basel). 2024 Jun 3;24(11):3613. doi: 10.3390/s24113613.

A novel generative adversarial networks modelling for the class imbalance problem in high dimensional omics data.

BMC Med Inform Decis Mak. 2024 Mar 28;24(1):90. doi: 10.1186/s12911-024-02487-2.

Dataset meta-level and statistical features affect machine learning performance.

Sci Rep. 2024 Jan 19;14(1):1670. doi: 10.1038/s41598-024-51825-x.

Backdoor Adjustment of Confounding by Provenance for Robust Text Classification of Multi-institutional Clinical Notes.

AMIA Annu Symp Proc. 2024 Jan 11;2023:923-932. eCollection 2023.

BioBERTurk: Exploring Turkish Biomedical Language Model Development Strategies in Low-Resource Setting.

J Healthc Inform Res. 2023 Sep 19;7(4):433-446. doi: 10.1007/s41666-023-00140-7. eCollection 2023 Dec.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

评估和缓解机器学习中类不平衡的影响及其在 X 射线成像中的应用。

Assessing and mitigating the effects of class imbalance in machine learning with application to X-ray imaging.

机构信息

出版信息

PURPOSE

METHODS

RESULTS

CONCLUSION

目的

方法

结果

结论

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献