Tsourounis Dimitrios, Kastaniotis Dimitris, Theoharatos Christos, Kazantzidis Andreas, Economou George
Department of Physics, University of Patras, 26504 Rio Patra, Greece.
IRIDA Labs S.A., Patras InnoHub, Kastritsiou 4, 26504 Rio Patra, Greece.
J Imaging. 2022 Sep 21;8(10):256. doi: 10.3390/jimaging8100256.
Despite the success of hand-crafted features in computer visioning for many years, nowadays, this has been replaced by end-to-end learnable features that are extracted from deep convolutional neural networks (CNNs). Whilst CNNs can learn robust features directly from image pixels, they require large amounts of samples and extreme augmentations. On the contrary, hand-crafted features, like SIFT, exhibit several interesting properties as they can provide local rotation invariance. In this work, a novel scheme combining the strengths of SIFT descriptors with CNNs, namely SIFT-CNN, is presented. Given a single-channel image, one SIFT descriptor is computed for every pixel, and thus, every pixel is represented as an M-dimensional histogram, which ultimately results in an M-channel image. Thus, the SIFT image is generated from the SIFT descriptors for all the pixels in a single-channel image, while at the same time, the original spatial size is preserved. Next, a CNN is trained to utilize these M-channel images as inputs by operating directly on the multiscale SIFT images with the regular convolution processes. Since these images incorporate spatial relations between the histograms of the SIFT descriptors, the CNN is guided to learn features from local gradient information of images that otherwise can be neglected. In this manner, the SIFT-CNN implicitly acquires a local rotation invariance property, which is desired for problems where local areas within the image can be rotated without affecting the overall classification result of the respective image. Some of these problems refer to indirect immunofluorescence (IIF) cell image classification, ground-based all-sky image-cloud classification and human lip-reading classification. The results for the popular datasets related to the three different aforementioned problems indicate that the proposed SIFT-CNN can improve the performance and surpasses the corresponding CNNs trained directly on pixel values in various challenging tasks due to its robustness in local rotations. Our findings highlight the importance of the input image representation in the overall efficiency of a data-driven system.
尽管多年来手工制作的特征在计算机视觉领域取得了成功,但如今,这已被从深度卷积神经网络(CNN)中提取的端到端可学习特征所取代。虽然CNN可以直接从图像像素中学习强大的特征,但它们需要大量的样本和极端的数据增强。相反,像SIFT这样的手工制作特征具有一些有趣的特性,因为它们可以提供局部旋转不变性。在这项工作中,提出了一种将SIFT描述符的优势与CNN相结合的新颖方案,即SIFT-CNN。给定一个单通道图像,为每个像素计算一个SIFT描述符,因此,每个像素都表示为一个M维直方图,最终得到一个M通道图像。这样,就从单通道图像中所有像素的SIFT描述符生成了SIFT图像,同时保留了原始的空间大小。接下来,训练一个CNN,通过使用常规卷积过程直接对多尺度SIFT图像进行操作,将这些M通道图像用作输入。由于这些图像包含SIFT描述符直方图之间的空间关系,因此引导CNN从图像的局部梯度信息中学习特征,否则这些信息可能会被忽略。通过这种方式,SIFT-CNN隐式地获得了局部旋转不变性属性,这对于图像中的局部区域可以旋转而不影响相应图像的整体分类结果的问题是很有必要的。其中一些问题涉及间接免疫荧光(IIF)细胞图像分类、地基全天空图像云分类和人类唇读分类。与上述三个不同问题相关的流行数据集的结果表明,由于其在局部旋转方面的鲁棒性,所提出的SIFT-CNN可以提高性能,并在各种具有挑战性的任务中超越直接在像素值上训练的相应CNN。我们的研究结果突出了输入图像表示在数据驱动系统整体效率中的重要性。