IEEE J Biomed Health Inform. 2022 Jan;26(1):172-182. doi: 10.1109/JBHI.2021.3119325. Epub 2022 Jan 17.
Till March 31st, 2021, the coronavirus disease 2019 (COVID-19) had reportedly infected more than 127 million people and caused over 2.5 million deaths worldwide. Timely diagnosis of COVID-19 is crucial for management of individual patients as well as containment of the highly contagious disease. Having realized the clinical value of non-contrast chest computed tomography (CT) for diagnosis of COVID-19, deep learning (DL) based automated methods have been proposed to aid the radiologists in reading the huge quantities of CT exams as a result of the pandemic. In this work, we address an overlooked problem for training deep convolutional neural networks for COVID-19 classification using real-world multi-source data, namely, the data source bias problem. The data source bias problem refers to the situation in which certain sources of data comprise only a single class of data, and training with such source-biased data may make the DL models learn to distinguish data sources instead of COVID-19. To overcome this problem, we propose MIx-aNd-Interpolate (MINI), a conceptually simple, easy-to-implement, efficient yet effective training strategy. The proposed MINI approach generates volumes of the absent class by combining the samples collected from different hospitals, which enlarges the sample space of the original source-biased dataset. Experimental results on a large collection of real patient data (1,221 COVID-19 and 1,520 negative CT images, and the latter consisting of 786 community acquired pneumonia and 734 non-pneumonia) from eight hospitals and health institutions show that: 1) MINI can improve COVID-19 classification performance upon the baseline (which does not deal with the source bias), and 2) MINI is superior to competing methods in terms of the extent of improvement.
截至 2021 年 3 月 31 日,据报道,2019 年冠状病毒病(COVID-19)已感染超过 1.27 亿人,并在全球范围内导致超过 250 万人死亡。COVID-19 的及时诊断对于个体患者的管理以及控制这种高度传染性疾病至关重要。鉴于非对比胸部计算机断层扫描(CT)在 COVID-19 诊断中的临床价值,深度学习(DL)基于自动化方法已被提出,以帮助放射科医生阅读由于大流行而导致的大量 CT 检查。在这项工作中,我们解决了使用真实世界多源数据训练用于 COVID-19 分类的深度卷积神经网络的一个被忽视的问题,即数据源偏差问题。数据源偏差问题是指某些数据源仅包含单一类别的数据的情况,并且使用此类源偏差数据进行训练可能会使 DL 模型学会区分数据源而不是 COVID-19。为了克服这个问题,我们提出了 MIx-aNd-Interpolate(MINI),这是一种概念简单、易于实现、高效且有效的训练策略。所提出的 MINI 方法通过组合来自不同医院的样本生成缺少类别的体积,从而扩大了原始源偏差数据集的样本空间。来自八个医院和医疗机构的大量真实患者数据(1221 例 COVID-19 和 1520 例阴性 CT 图像,后者由 786 例社区获得性肺炎和 734 例非肺炎组成)的实验结果表明:1)MINI 可以提高基础水平的 COVID-19 分类性能(不处理源偏差),2)MINI 在改进程度方面优于竞争方法。