Couch Josiah, Arnaout Rima, Arnaout Ramy
Department of Pathology at Beth Israel Deaconess Medical Center (BIDMC), Boston, MA 02215.
Department of Medicine, the Bakar Institute for Computational Health Sciences, and the Center for Intelligent Imaging at the University of California San Francisco, San Francisco, CA 94143.
ArXiv. 2024 Jul 31:arXiv:2407.15724v2.
In deep learning, achieving high performance on image classification tasks requires diverse training sets. However, the current best practice-maximizing dataset size and class balance-does not guarantee dataset diversity. We hypothesized that, for a given model architecture, model performance can be improved by maximizing diversity more directly. To test this hypothesis, we introduce a comprehensive framework of diversity measures from ecology that generalizes familiar quantities like Shannon entropy by accounting for similarities among images. (Size and class balance emerge as special cases.) Analyzing thousands of subsets from seven medical datasets showed that the best correlates of performance were not size or class balance but -"big alpha"-a set of generalized entropy measures interpreted as the effective number of image-class pairs in the dataset, after accounting for image similarities. One of these, , explained 67% of the variance in balanced accuracy, vs. 54% for class balance and just 39% for size. The best pair of measures was size-plus- (79%), which outperformed size-plus-class-balance (74%). Subsets with the largest performed up to 16% better than those with the largest size (median improvement, 8%). We propose maximizing as a way to improve deep learning performance in medical imaging.
在深度学习中,要在图像分类任务上实现高性能需要多样的训练集。然而,当前的最佳实践——最大化数据集大小和类别平衡——并不能保证数据集的多样性。我们假设,对于给定的模型架构,通过更直接地最大化多样性可以提高模型性能。为了验证这一假设,我们引入了一个来自生态学的全面的多样性度量框架,该框架通过考虑图像之间的相似性来推广诸如香农熵等常见量。(大小和类别平衡是特殊情况。)对七个医学数据集的数千个子集进行分析表明,性能的最佳相关因素不是大小或类别平衡,而是“大阿尔法”——一组广义熵度量,在考虑图像相似性后,被解释为数据集中图像-类别对的有效数量。其中一个,解释了平衡准确率方差的67%,而类别平衡为54%,大小仅为39%。最佳的度量组合是大小加(79%),其性能优于大小加类别平衡(74%)。具有最大的子集比具有最大大小的子集性能高出16%(中位数提高8%)。我们建议最大化作为提高医学成像深度学习性能的一种方法。