Suppr超能文献

深度学习模型在乳房X光片分类中的性能不一致。

Inconsistent Performance of Deep Learning Models on Mammogram Classification.

作者信息

Wang Xiaoqin, Liang Gongbo, Zhang Yu, Blanton Hunter, Bessinger Zachary, Jacobs Nathan

机构信息

Department of Radiology, University of Kentucky, Lexington, Kentucky; Markey Cancer Center, University of Kentucky, Lexington, Kentucky.

Department of Computer Science, University of Kentucky, Lexington, Kentucky.

出版信息

J Am Coll Radiol. 2020 Jun;17(6):796-803. doi: 10.1016/j.jacr.2020.01.006. Epub 2020 Feb 14.

Abstract

OBJECTIVES

Performance of recently developed deep learning models for image classification surpasses that of radiologists. However, there are questions about model performance consistency and generalization in unseen external data. The purpose of this study is to determine whether the high performance of deep learning on mammograms can be transferred to external data with a different data distribution.

MATERIALS AND METHODS

Six deep learning models (three published models with high performance and three models designed by us) were evaluated on four different mammogram data sets, including three public (Digital Database for Screening Mammography, INbreast, and Mammographic Image Analysis Society) and one private data set (UKy). The models were trained and validated on either Digital Database for Screening Mammography alone or a combined data set that included Digital Database for Screening Mammography. The models were then tested on the three external data sets. The area under the receiver operating characteristic curve (auROC) was used to evaluate model performance.

RESULTS

The three published models reported validation auROC scores between 0.88 and 0.95 on the validation data set. Our models achieved between 0.71 (95% confidence interval [CI]: 0.70-0.72) and 0.79 (95% CI: 0.78-0.80) auROC on the same validation data set. However, the same evaluation criteria of all six models on the three external test data sets were significantly decreased, only between 0.44 (95% CI: 0.43-0.45) and 0.65 (95% CI: 0.64-0.66).

CONCLUSION

Our results demonstrate performance inconsistency across the data sets and models, indicating that the high performance of deep learning models on one data set cannot be readily transferred to unseen external data sets, and these models need further assessment and validation before being applied in clinical practice.

摘要

目的

最近开发的用于图像分类的深度学习模型的性能超过了放射科医生。然而,对于模型性能在未见外部数据中的一致性和泛化性存在疑问。本研究的目的是确定深度学习在乳腺钼靶图像上的高性能是否可以转移到具有不同数据分布的外部数据上。

材料与方法

在四个不同的乳腺钼靶数据集上评估了六个深度学习模型(三个已发表的高性能模型和我们设计的三个模型),包括三个公共数据集(乳腺钼靶筛查数字数据库、INbreast和乳腺钼靶图像分析协会)和一个私人数据集(UKy)。这些模型在单独的乳腺钼靶筛查数字数据库或包含乳腺钼靶筛查数字数据库的组合数据集上进行训练和验证。然后在三个外部数据集上对模型进行测试。采用受试者操作特征曲线下面积(auROC)来评估模型性能。

结果

三个已发表的模型在验证数据集上报告的验证auROC分数在0.88至0.95之间。我们的模型在相同的验证数据集上的auROC为0.71(95%置信区间[CI]:0.70 - 0.72)至0.79(95%CI:0.78 - 0.80)。然而,所有六个模型在三个外部测试数据集上的相同评估标准显著降低,仅在0.44(95%CI:0.43 - 0.45)至0.65(95%CI:0.64 - 0.66)之间。

结论

我们的结果表明各数据集和模型之间存在性能不一致性,这表明深度学习模型在一个数据集上的高性能不能轻易转移到未见的外部数据集,并且这些模型在应用于临床实践之前需要进一步评估和验证。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验