Ahmed Kaoutar Ben, Goldgof Gregory M, Paul Rahul, Goldgof Dmitry B, Hall Lawrence O
Department of Computer Science and Engineering, University of South Florida, Tampa, FL 33620, USA.
Department of Laboratory Medicine, The University of California, San Francisco, CA 94143, USA.
IEEE Access. 2021;9:72970-72979. doi: 10.1109/access.2021.3079716. Epub 2021 May 13.
A number of recent papers have shown experimental evidence that suggests it is possible to build highly accurate deep neural network models to detect COVID-19 from chest X-ray images. In this paper, we show that good generalization to unseen sources has not been achieved. Experiments with richer data sets than have previously been used show models have high accuracy on seen sources, but poor accuracy on unseen sources. The reason for the disparity is that the convolutional neural network model, which learns features, can focus on differences in X-ray machines or in positioning within the machines, for example. Any feature that a person would clearly rule out is called a confounding feature. Some of the models were trained on COVID-19 image data taken from publications, which may be different than raw images. Some data sets were of pediatric cases with pneumonia where COVID-19 chest X-rays are almost exclusively from adults, so lung size becomes a spurious feature that can be exploited. In this work, we have eliminated many confounding features by working with as close to raw data as possible. Still, deep learned models may leverage source specific confounders to differentiate COVID-19 from pneumonia preventing generalizing to new data sources (i.e. external sites). Our models have achieved an AUC of 1.00 on seen data sources but in the worst case only scored an AUC of 0.38 on unseen ones. This indicates that such models need further assessment/development before they can be broadly clinically deployed. An example of fine-tuning to improve performance at a new site is given.
最近的一些论文已经展示了实验证据,表明可以构建高度准确的深度神经网络模型,从胸部X光图像中检测出新型冠状病毒肺炎(COVID-19)。在本文中,我们表明尚未实现对未见来源的良好泛化。使用比以前使用的更丰富的数据集进行的实验表明,模型在可见来源上具有高精度,但在未见来源上的精度较差。差异的原因在于,学习特征的卷积神经网络模型可能会关注X光机之间的差异或机器内部的定位差异等。任何一个人会明确排除的特征都被称为混杂特征。一些模型是在从出版物中获取的COVID-19图像数据上进行训练的,这些数据可能与原始图像不同。一些数据集是儿科肺炎病例,而COVID-19的胸部X光几乎完全来自成年人,因此肺的大小成为一个可以被利用的虚假特征。在这项工作中,我们通过尽可能使用接近原始数据的方式消除了许多混杂特征。尽管如此,深度学习模型可能会利用特定于来源的混杂因素来区分COVID-19和肺炎,从而无法泛化到新的数据来源(即外部站点)。我们的模型在可见数据源上的AUC达到了1.00,但在最坏的情况下,在未见数据源上的AUC仅为0.38。这表明此类模型在能够广泛临床应用之前需要进一步评估/开发。文中给出了在新站点进行微调以提高性能的一个示例。