Reina G Anthony, Panchumarthy Ravi, Thakur Siddhesh Pravin, Bastidas Alexei, Bakas Spyridon
Intel Corporation, Santa Clara, CA, United States.
Center for Biomedical Image Computing and Analytics, University of Pennsylvania, Philadelphia, PA, United States.
Front Neurosci. 2020 Feb 7;14:65. doi: 10.3389/fnins.2020.00065. eCollection 2020.
Convolutional neural network (CNN) models obtain state of the art performance on image classification, localization, and segmentation tasks. Limitations in computer hardware, most notably memory size in deep learning accelerator cards, prevent relatively large images, such as those from medical and satellite imaging, from being processed as a whole in their original resolution. A fully convolutional topology, such as U-Net, is typically trained on down-sampled images and inferred on images of their original size and resolution, by simply dividing the larger image into smaller (typically overlapping) tiles, making predictions on these tiles, and stitching them back together as the prediction for the whole image. In this study, we show that this tiling technique combined with translationally-invariant nature of CNNs causes small, but relevant differences during inference that can be detrimental in the performance of the model. Here we quantify these variations in both medical (i.e., BraTS) and non-medical (i.e., satellite) images and show that training a 2D U-Net model on the whole image substantially improves the overall model performance. Finally, we compare 2D and 3D semantic segmentation models to show that providing CNN models with a wider context of the image in all three dimensions leads to more accurate and consistent predictions. Our results suggest that tiling the input to CNN models-while perhaps necessary to overcome the memory limitations in computer hardware-may lead to undesirable and unpredictable errors in the model's output that can only be adequately mitigated by increasing the input of the model to the largest possible tile size.
卷积神经网络(CNN)模型在图像分类、定位和分割任务中取得了领先的性能。计算机硬件的限制,最显著的是深度学习加速卡中的内存大小,使得相对较大的图像,如医学和卫星成像中的图像,无法以其原始分辨率作为一个整体进行处理。一种全卷积拓扑结构,如U-Net,通常在降采样图像上进行训练,并在原始大小和分辨率的图像上进行推理,方法是简单地将较大的图像划分为较小的(通常是重叠的)图像块,对这些图像块进行预测,然后将它们拼接在一起作为整个图像的预测。在本研究中,我们表明,这种图像块技术与CNN的平移不变性相结合,在推理过程中会导致微小但相关的差异,这可能对模型的性能产生不利影响。在这里,我们量化了医学(即BraTS)和非医学(即卫星)图像中的这些变化,并表明在整个图像上训练二维U-Net模型可显著提高整体模型性能。最后,我们比较了二维和三维语义分割模型,以表明在所有三个维度上为CNN模型提供更广泛的图像上下文会导致更准确和一致的预测。我们的结果表明,将输入划分为图像块——虽然这可能是克服计算机硬件内存限制所必需的——可能会导致模型输出中出现不良和不可预测的错误,只有通过将模型的输入增加到尽可能大的图像块大小才能充分减轻这些错误。