文献检索文档翻译深度研究
Suppr Zotero 插件Zotero 插件
邀请有礼套餐&价格历史记录

新学期,新优惠

限时优惠:9月1日-9月22日

30天高级会员仅需29元

1天体验卡首发特惠仅需5.99元

了解详情
不再提醒
插件&应用
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
高级版
套餐订阅购买积分包
AI 工具
文献检索文档翻译深度研究
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2025

医学图像分类中的不平衡问题:提高判别和校准性能的评估实践

Class imbalance on medical image classification: towards better evaluation practices for discrimination and calibration performance.

机构信息

Hospital Italiano de Buenos Aires, Buenos Aires, Argentina.

Universidad Tecnológica Nacional, Buenos Aires, Argentina.

出版信息

Eur Radiol. 2024 Dec;34(12):7895-7903. doi: 10.1007/s00330-024-10834-0. Epub 2024 Jun 11.


DOI:10.1007/s00330-024-10834-0
PMID:38861161
Abstract

PURPOSE: This work aims to assess standard evaluation practices used by the research community for evaluating medical imaging classifiers, with a specific focus on the implications of class imbalance. The analysis is performed on chest X-rays as a case study and encompasses a comprehensive model performance definition, considering both discriminative capabilities and model calibration. MATERIALS AND METHODS: We conduct a concise literature review to examine prevailing scientific practices used when evaluating X-ray classifiers. Then, we perform a systematic experiment on two major chest X-ray datasets to showcase a didactic example of the behavior of several performance metrics under different class ratios and highlight how widely adopted metrics can conceal performance in the minority class. RESULTS: Our literature study confirms that: (1) even when dealing with highly imbalanced datasets, the community tends to use metrics that are dominated by the majority class; and (2) it is still uncommon to include calibration studies for chest X-ray classifiers, albeit its importance in the context of healthcare. Moreover, our systematic experiments confirm that current evaluation practices may not reflect model performance in real clinical scenarios and suggest complementary metrics to better reflect the performance of the system in such scenarios. CONCLUSION: Our analysis underscores the need for enhanced evaluation practices, particularly in the context of class-imbalanced chest X-ray classifiers. We recommend the inclusion of complementary metrics such as the area under the precision-recall curve (AUC-PR), adjusted AUC-PR, and balanced Brier score, to offer a more accurate depiction of system performance in real clinical scenarios, considering metrics that reflect both, discrimination and calibration performance. CLINICAL RELEVANCE STATEMENT: This study underscores the critical need for refined evaluation metrics in medical imaging classifiers, emphasizing that prevalent metrics may mask poor performance in minority classes, potentially impacting clinical diagnoses and healthcare outcomes. KEY POINTS: Common scientific practices in papers dealing with X-ray computer-assisted diagnosis (CAD) systems may be misleading. We highlight limitations in reporting of evaluation metrics for X-ray CAD systems in highly imbalanced scenarios. We propose adopting alternative metrics based on experimental evaluation on large-scale datasets.

摘要

目的:本研究旨在评估医学成像分类器研究社区中使用的标准评估实践,特别关注类别不平衡的影响。本研究以胸部 X 射线为例进行分析,涵盖了全面的模型性能定义,考虑了判别能力和模型校准。 材料与方法:我们进行了简短的文献综述,以检查评估 X 射线分类器时使用的普遍科学实践。然后,我们在两个主要的胸部 X 射线数据集上进行了系统实验,展示了几个性能指标在不同类别比例下的行为,并强调了广泛采用的指标如何掩盖少数类别中的性能。 结果:我们的文献研究证实:(1)即使在处理高度不平衡的数据集时,社区倾向于使用主要类别主导的指标;(2)虽然在医疗保健背景下校准研究很重要,但胸部 X 射线分类器中仍很少包含校准研究。此外,我们的系统实验证实,当前的评估实践可能无法反映模型在实际临床场景中的性能,并建议使用补充指标来更好地反映系统在这些场景中的性能。 结论:我们的分析强调了需要增强评估实践,特别是在类别不平衡的胸部 X 射线分类器的背景下。我们建议使用补充指标,如精度-召回曲线下面积(AUC-PR)、调整后的 AUC-PR 和平衡 Brier 得分,以更准确地描述系统在实际临床场景中的性能,同时考虑反映判别和校准性能的指标。 临床相关性声明:本研究强调了在医学成像分类器中需要更精细的评估指标,突出了常见指标可能掩盖少数类别的性能不佳,这可能会影响临床诊断和医疗保健结果。 要点:处理 X 射线计算机辅助诊断(CAD)系统的论文中的常见科学实践可能存在误导性。我们强调了在高度不平衡情况下报告 X 射线 CAD 系统评估指标的局限性。我们建议根据大规模数据集上的实验评估采用替代指标。

相似文献

[1]
Class imbalance on medical image classification: towards better evaluation practices for discrimination and calibration performance.

Eur Radiol. 2024-12

[2]
Class imbalance should not throw you off balance: Choosing the right classifiers and performance metrics for brain decoding with imbalanced data.

Neuroimage. 2023-8-15

[3]
Assessing and mitigating the effects of class imbalance in machine learning with application to X-ray imaging.

Int J Comput Assist Radiol Surg. 2020-12

[4]
COVID-19 diagnosis: A comprehensive review of pre-trained deep learning models based on feature extraction algorithm.

Results Eng. 2023-6

[5]
Deep learning model calibration for improving performance in class-imbalanced medical image classification tasks.

PLoS One. 2022

[6]
COVID-19 identification in chest X-ray images on flat and hierarchical classification scenarios.

Comput Methods Programs Biomed. 2020-5-8

[7]
Conversion of adverse data corpus to shrewd output using sampling metrics.

Vis Comput Ind Biomed Art. 2020-8-11

[8]
Machine learning algorithms for outcome prediction in (chemo)radiotherapy: An empirical comparison of classifiers.

Med Phys. 2018-6-13

[9]
BarlowTwins-CXR: enhancing chest X-ray abnormality localization in heterogeneous data with cross-domain self-supervised learning.

BMC Med Inform Decis Mak. 2024-5-16

[10]
FLANNEL (Focal Loss bAsed Neural Network EnsembLe) for COVID-19 detection.

J Am Med Inform Assoc. 2021-3-1

引用本文的文献

[1]
Dual-model approach for accurate chest disease detection using GViT and swin transformer V2.

Sci Rep. 2025-8-28

[2]
Integrating support vector machines and deep learning features for oral cancer histopathology analysis.

Biol Methods Protoc. 2025-5-5

[3]
Machine Learning-Based Prediction of Unplanned Readmission Due to Major Adverse Cardiac Events Among Hospitalized Patients with Blood Cancers.

Cancer Control. 2025

[4]
An Integrated Deep Learning Model with EfficientNet and ResNet for Accurate Multi-Class Skin Disease Classification.

Diagnostics (Basel). 2025-2-25

[5]
CORE-MD clinical risk score for regulatory evaluation of artificial intelligence-based medical device software.

NPJ Digit Med. 2025-2-6

[6]
Hybrid transformer-based model for mammogram classification by integrating prior and current images.

Med Phys. 2025-5

[7]
Predicting the toxic side effects of drug interactions using chemical structures and protein sequences.

Sci Rep. 2024-12-28

[8]
Anterior Cruciate Ligament Tear Detection Based on T-Distribution Slice Attention Framework with Penalty Weight Loss Optimisation.

Bioengineering (Basel). 2024-8-30

[9]
Determining risk and predictors of head and neck cancer treatment-related lymphedema: A clinicopathologic and dosimetric data mining approach using interpretable machine learning and ensemble feature selection.

Clin Transl Radiat Oncol. 2024-2-28

[10]
Pathological changes or technical artefacts? The problem of the heterogenous databases in COVID-19 CXR image analysis.

Comput Methods Programs Biomed. 2023-10

本文引用的文献

[1]
Interpreting area under the receiver operating characteristic curve.

Lancet Digit Health. 2022-12

[2]
Magician's Corner: 9. Performance Metrics for Machine Learning Models.

Radiol Artif Intell. 2021-5-12

[3]
Deconstructing Cross-Entropy for Probabilistic Binary Classifiers.

Entropy (Basel). 2018-3-20

[4]
Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis.

Proc Natl Acad Sci U S A. 2020-5-26

[5]
Challenges to the Reproducibility of Machine Learning Models in Health Care.

JAMA. 2020-1-28

[6]
Artificial intelligence in healthcare.

Nat Biomed Eng. 2018-10-10

[7]
Reporting of artificial intelligence prediction models.

Lancet. 2019-4-20

[8]
A calibration hierarchy for risk models was defined: from utopia to empirical data.

J Clin Epidemiol. 2016-6

[9]
The precision--recall curve overcame the optimism of the receiver operating characteristic curve in rare diseases.

J Clin Epidemiol. 2015-2-28

[10]
The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets.

PLoS One. 2015-3-4

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

推荐工具

医学文档翻译智能文献检索