Department of Radiology, University of Michigan, Ann Arbor, MI, USA.
Med Phys. 2021 Jun;48(6):2827-2837. doi: 10.1002/mp.14678. Epub 2021 Apr 12.
Transfer learning is commonly used in deep learning for medical imaging to alleviate the problem of limited available data. In this work, we studied the risk of feature leakage and its dependence on sample size when using pretrained deep convolutional neural network (DCNN) as feature extractor for classification breast masses in mammography.
Feature leakage occurs when the training set is used for feature selection and classifier modeling while the cost function is guided by the validation performance or informed by the test performance. The high-dimensional feature space extracted from pretrained DCNN suffers from the curse of dimensionality; feature subsets that can provide excessively optimistic performance can be found for the validation set or test set if the latter is allowed for unlimited reuse during algorithm development. We designed a simulation study to examine feature leakage when using DCNN as feature extractor for mass classification in mammography. Four thousand five hundred and seventy-seven unique mass lesions were partitioned by patient into three sets: 3222 for training, 508 for validation, and 847 for independent testing. Three pretrained DCNNs, AlexNet, GoogLeNet, and VGG16, were first compared using a training set in fourfold cross validation and one was selected as the feature extractor. To assess generalization errors, the independent test set was sequestered as truly unseen cases. A training set of a range of sizes from 10% to 75% was simulated by random drawing from the available training set in addition to 100% of the training set. Three commonly used feature classifiers, the linear discriminant, the support vector machine, and the random forest were evaluated. A sequential feature selection method was used to find feature subsets that could achieve high classification performance in terms of the area under the receiver operating characteristic curve (AUC) in the validation set. The extent of feature leakage and the impact of training set size were analyzed by comparison to the performance in the unseen test set.
All three classifiers showed large generalization error between the validation set and the independent sequestered test set at all sample sizes. The generalization error decreased as the sample size increased. At 100% of the sample size, one classifier achieved an AUC as high as 0.91 on the validation set while the corresponding performance on the unseen test set only reached an AUC of 0.72.
Our results demonstrate that large generalization errors can occur in AI tools due to feature leakage. Without evaluation on unseen test cases, optimistically biased performance may be reported inadvertently, and can lead to unrealistic expectations and reduce confidence for clinical implementation.
在医学成像领域的深度学习中,迁移学习常用于缓解可用数据有限的问题。本研究旨在研究使用预先训练的深度卷积神经网络(DCNN)作为特征提取器对乳腺肿块进行分类时,特征泄露的风险及其对样本量的依赖性。
当训练集用于特征选择和分类器建模,而成本函数由验证性能指导或由测试性能告知时,就会发生特征泄露。从预先训练的 DCNN 中提取的高维特征空间存在维度灾难;如果允许在算法开发过程中无限次重复使用测试集,则可以为验证集或测试集找到提供过于乐观性能的特征子集。我们设计了一项模拟研究,以检查在使用 DCNN 作为特征提取器对乳腺钼靶片中的肿块进行分类时特征泄露的情况。将 4577 个独特的肿块病例按患者分为三组:3222 个用于训练,508 个用于验证,847 个用于独立测试。首先,在 4 倍交叉验证中比较了 AlexNet、GoogLeNet 和 VGG16 这三个预先训练的 DCNN,并选择了其中一个作为特征提取器。为了评估泛化误差,将独立的测试集作为真正未见过的病例。通过从可用的训练集中随机抽取,模拟了从 10%到 75%的训练集大小范围,此外还模拟了 100%的训练集。评估了三种常用的特征分类器,即线性判别分析、支持向量机和随机森林。使用顺序特征选择方法找到特征子集,以在验证集的接收者操作特征曲线(AUC)下获得高分类性能。通过与未见测试集的性能比较,分析了特征泄露的程度和训练集大小的影响。
在所有样本大小下,三种分类器在验证集和独立的隔离测试集之间都表现出较大的泛化误差。随着样本量的增加,泛化误差逐渐减小。在 100%的样本量下,一个分类器在验证集上的 AUC 高达 0.91,而相应的未见测试集上的 AUC 仅为 0.72。
我们的结果表明,由于特征泄露,人工智能工具可能会出现较大的泛化误差。如果不在未见测试案例上进行评估,可能会无意中报告乐观偏差的性能,从而导致不切实际的期望并降低临床实施的信心。