Calderon-Ramirez Saul, Yang Shengxiang, Moemeni Armaghan, Elizondo David, Colreavy-Donnelly Simon, Chavarría-Estrada Luis Fernando, Molina-Cabello Miguel A
Centre for Computational Intelligence (CCI), De Montfort University, United Kingdom.
Instituto Tecnologico de Costa Rica, Costa Rica.
Appl Soft Comput. 2021 Nov;111:107692. doi: 10.1016/j.asoc.2021.107692. Epub 2021 Jul 13.
A key factor in the fight against viral diseases such as the coronavirus (COVID-19) is the identification of virus carriers as early and quickly as possible, in a cheap and efficient manner. The application of deep learning for image classification of chest X-ray images of COVID-19 patients could become a useful pre-diagnostic detection methodology. However, deep learning architectures require large labelled datasets. This is often a limitation when the subject of research is relatively new as in the case of the virus outbreak, where dealing with small labelled datasets is a challenge. Moreover, in such context, the datasets are also highly imbalanced, with few observations from positive cases of the new disease. In this work we evaluate the performance of the semi-supervised deep learning architecture known as MixMatch with a very limited number of labelled observations and highly imbalanced labelled datasets. We demonstrate the critical impact of data imbalance to the model's accuracy. Therefore, we propose a simple approach for correcting data imbalance, by re-weighting each observation in the loss function, giving a higher weight to the observations corresponding to the under-represented class. For unlabelled observations, we use the pseudo and augmented labels calculated by MixMatch to choose the appropriate weight. The proposed method improved classification accuracy by up to 18%, with respect to the non balanced MixMatch algorithm. We tested our proposed approach with several available datasets using 10, 15 and 20 labelled observations, for binary classification (COVID-19 positive and normal cases). For multi-class classification (COVID-19 positive, pneumonia and normal cases), we tested 30, 50, 70 and 90 labelled observations. Additionally, a new dataset is included among the tested datasets, composed of chest X-ray images of Costa Rican adult patients.
对抗冠状病毒(COVID-19)等病毒性疾病的一个关键因素是以低成本、高效率尽早且快速地识别病毒携带者。将深度学习应用于COVID-19患者胸部X光图像的分类,可能会成为一种有用的预诊断检测方法。然而,深度学习架构需要大量带标签的数据集。当研究对象相对较新时,比如在病毒爆发的情况下,这往往是一个限制,因为处理小的带标签数据集是一项挑战。此外,在这种情况下,数据集也高度不平衡,新疾病阳性病例的观测数据很少。在这项工作中,我们评估了一种名为MixMatch的半监督深度学习架构在标记观测数量非常有限且标记数据集高度不平衡的情况下的性能。我们展示了数据不平衡对模型准确性的关键影响。因此,我们提出了一种简单的方法来纠正数据不平衡,即在损失函数中对每个观测重新加权,对代表不足的类别对应的观测赋予更高的权重。对于未标记的观测,我们使用MixMatch计算的伪标签和增强标签来选择合适的权重。相对于非平衡的MixMatch算法,所提出的方法将分类准确率提高了高达18%。我们使用10、15和20个标记观测对我们提出的方法在几个可用数据集上进行了测试,用于二分类(COVID-19阳性和正常病例)。对于多分类(COVID-19阳性、肺炎和正常病例),我们测试了30、50、70和90个标记观测。此外,在测试数据集中还包括了一个新的数据集,该数据集由哥斯达黎加成年患者的胸部X光图像组成。