超越规模和类别平衡：Alpha作为深度学习的新数据集质量指标

Beyond Size and Class Balance: Alpha as a New Dataset Quality Metric for Deep Learning.

作者信息

Couch Josiah, Arnaout Rima, Arnaout Ramy

机构信息

Department of Pathology at Beth Israel Deaconess Medical Center (BIDMC), Boston, MA 02215.

Department of Medicine, the Bakar Institute for Computational Health Sciences, and the Center for Intelligent Imaging at the University of California San Francisco, San Francisco, CA 94143.

出版信息

ArXiv. 2024 Jul 31:arXiv:2407.15724v2.

PMID:39830079

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11741458/

Abstract

In deep learning, achieving high performance on image classification tasks requires diverse training sets. However, the current best practice-maximizing dataset size and class balance-does not guarantee dataset diversity. We hypothesized that, for a given model architecture, model performance can be improved by maximizing diversity more directly. To test this hypothesis, we introduce a comprehensive framework of diversity measures from ecology that generalizes familiar quantities like Shannon entropy by accounting for similarities among images. (Size and class balance emerge as special cases.) Analyzing thousands of subsets from seven medical datasets showed that the best correlates of performance were not size or class balance but -"big alpha"-a set of generalized entropy measures interpreted as the effective number of image-class pairs in the dataset, after accounting for image similarities. One of these, , explained 67% of the variance in balanced accuracy, vs. 54% for class balance and just 39% for size. The best pair of measures was size-plus- (79%), which outperformed size-plus-class-balance (74%). Subsets with the largest performed up to 16% better than those with the largest size (median improvement, 8%). We propose maximizing as a way to improve deep learning performance in medical imaging.

摘要

在深度学习中，要在图像分类任务上实现高性能需要多样的训练集。然而，当前的最佳实践——最大化数据集大小和类别平衡——并不能保证数据集的多样性。我们假设，对于给定的模型架构，通过更直接地最大化多样性可以提高模型性能。为了验证这一假设，我们引入了一个来自生态学的全面的多样性度量框架，该框架通过考虑图像之间的相似性来推广诸如香农熵等常见量。（大小和类别平衡是特殊情况。）对七个医学数据集的数千个子集进行分析表明，性能的最佳相关因素不是大小或类别平衡，而是“大阿尔法”——一组广义熵度量，在考虑图像相似性后，被解释为数据集中图像-类别对的有效数量。其中一个，解释了平衡准确率方差的67%，而类别平衡为54%，大小仅为39%。最佳的度量组合是大小加（79%），其性能优于大小加类别平衡（74%）。具有最大的子集比具有最大大小的子集性能高出16%（中位数提高8%）。我们建议最大化作为提高医学成像深度学习性能的一种方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8449/11741458/9b49168ac3ac/nihpp-2407.15724v2-f0001.jpg

相似文献

Beyond Size and Class Balance: Alpha as a New Dataset Quality Metric for Deep Learning.超越规模和类别平衡：Alpha作为深度学习的新数据集质量指标

ArXiv. 2024 Jul 31:arXiv:2407.15724v2.

A novel adaptive cubic quasi-Newton optimizer for deep learning based medical image analysis tasks, validated on detection of COVID-19 and segmentation for COVID-19 lung infection, liver tumor, and optic disc/cup.一种用于深度学习的新型自适应三次拟牛顿优化器，在 COVID-19 检测和 COVID-19 肺部感染、肝脏肿瘤以及视盘/杯分割等医学图像分析任务中得到验证。

Med Phys. 2023 Mar;50(3):1528-1538. doi: 10.1002/mp.15969. Epub 2022 Oct 6.

CaMeL-Net: Centroid-aware metric learning for efficient multi-class cancer classification in pathology images.CamEL-Net：用于病理图像中高效多类癌症分类的质心感知度量学习。

Comput Methods Programs Biomed. 2023 Nov;241:107749. doi: 10.1016/j.cmpb.2023.107749. Epub 2023 Aug 9.

Brain tumor segmentation and detection in MRI using convolutional neural networks and VGG16.使用卷积神经网络和VGG16在磁共振成像（MRI）中进行脑肿瘤分割与检测

Cancer Biomark. 2025 Mar;42(3):18758592241311184. doi: 10.1177/18758592241311184. Epub 2025 Apr 4.

A medical image classification method based on self-regularized adversarial learning.基于自正则化对抗学习的医学图像分类方法。

Med Phys. 2024 Nov;51(11):8232-8246. doi: 10.1002/mp.17320. Epub 2024 Jul 30.

Deep Convolution Neural Network for Malignancy Detection and Classification in Microscopic Uterine Cervix Cell Images.用于子宫颈细胞显微图像中恶性肿瘤检测与分类的深度卷积神经网络

Asian Pac J Cancer Prev. 2019 Nov 1;20(11):3447-3456. doi: 10.31557/APJCP.2019.20.11.3447.

Improved Training Efficiency for Retinopathy of Prematurity Deep Learning Models Using Comparison versus Class Labels.使用比较与类别标签提高早产儿视网膜病变深度学习模型的训练效率

Ophthalmol Sci. 2022 Feb 2;2(2):100122. doi: 10.1016/j.xops.2022.100122. eCollection 2022 Jun.

Blood Stain Classification with Hyperspectral Imaging and Deep Neural Networks.高光谱成像与深度神经网络在血痕分类中的应用。

Sensors (Basel). 2020 Nov 21;20(22):6666. doi: 10.3390/s20226666.

Reducing annotation effort in digital pathology: A Co-Representation learning framework for classification tasks.减少数字病理学中的注释工作：用于分类任务的协同表示学习框架。

Med Image Anal. 2021 Jan;67:101859. doi: 10.1016/j.media.2020.101859. Epub 2020 Oct 9.

MABAL: a Novel Deep-Learning Architecture for Machine-Assisted Bone Age Labeling.MABAL：一种用于机器辅助骨龄标注的新型深度学习架构。

J Digit Imaging. 2018 Aug;31(4):513-519. doi: 10.1007/s10278-018-0053-3.

本文引用的文献

Novel Techniques in Imaging Congenital Heart Disease: JACC Scientific Statement.先天性心脏病影像学的新方法：美国心脏病学会科学声明。

J Am Coll Cardiol. 2024 Jan 2;83(1):63-81. doi: 10.1016/j.jacc.2023.10.025.

Principles for Health Information Collection, Sharing, and Use: A Policy Statement From the American Heart Association.健康信息采集、共享和使用原则：美国心脏协会的政策声明。

Circulation. 2023 Sep 26;148(13):1061-1069. doi: 10.1161/CIR.0000000000001173. Epub 2023 Aug 30.

Proceedings of the NHLBI Workshop on Artificial Intelligence in Cardiovascular Imaging: Translation to Patient Care.美国国立卫生研究院心肺成像人工智能研讨会记录：转化为患者护理

JACC Cardiovasc Imaging. 2023 Sep;16(9):1209-1223. doi: 10.1016/j.jcmg.2023.05.012. Epub 2023 Jul 19.

ENRICHing medical imaging training sets enables more efficient machine learning.丰富医学影像训练集可实现更高效的机器学习。

J Am Med Inform Assoc. 2023 May 19;30(6):1079-1090. doi: 10.1093/jamia/ocad055.

Domain-guided data augmentation for deep learning on medical imaging.基于领域引导的数据增强在医学图像深度学习中的应用。

PLoS One. 2023 Mar 23;18(3):e0282532. doi: 10.1371/journal.pone.0282532. eCollection 2023.

MedMNIST v2 - A large-scale lightweight benchmark for 2D and 3D biomedical image classification.MedMNIST v2 - 用于 2D 和 3D 生物医学图像分类的大规模轻量级基准。

Sci Data. 2023 Jan 19;10(1):41. doi: 10.1038/s41597-022-01721-8.

The Liver Tumor Segmentation Benchmark (LiTS).肝脏肿瘤分割基准（LiTS）。

Med Image Anal. 2023 Feb;84:102680. doi: 10.1016/j.media.2022.102680. Epub 2022 Nov 17.

Relating instance hardness to classification performance in a dataset: a visual approach.将数据集中的实例硬度与分类性能相关联：一种可视化方法。

Mach Learn. 2022;111(8):3085-3123. doi: 10.1007/s10994-022-06205-9. Epub 2022 Jun 22.

The class imbalance problem.类别不平衡问题。

Nat Methods. 2021 Nov;18(11):1270-1272. doi: 10.1038/s41592-021-01302-4.

Color Image Complexity versus Over-Segmentation: A Preliminary Study on the Correlation between Complexity Measures and Number of Segments.彩色图像复杂度与过度分割：复杂度度量与分割段数之间相关性的初步研究

J Imaging. 2020 Mar 30;6(4):16. doi: 10.3390/jimaging6040016.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

超越规模和类别平衡：Alpha作为深度学习的新数据集质量指标

Beyond Size and Class Balance: Alpha as a New Dataset Quality Metric for Deep Learning.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献