Department of Computer Science and Engineering, Dongguk University, Seoul 04620, Korea.
Department of Artificial Intelligence, Dongguk University, Seoul 04620, Korea.
Sensors (Basel). 2022 Feb 13;22(4):1429. doi: 10.3390/s22041429.
Transformer-based approaches have shown good results in image captioning tasks. However, current approaches have a limitation in generating text from global features of an entire image. Therefore, we propose novel methods for generating better image captioning as follows: (1) The Global-Local Visual Extractor (GLVE) to capture both global features and local features. (2) The Cross Encoder-Decoder Transformer (CEDT) for injecting multiple-level encoder features into the decoding process. GLVE extracts not only global visual features that can be obtained from an entire image, such as size of organ or bone structure, but also local visual features that can be generated from a local region, such as lesion area. Given an image, CEDT can create a detailed description of the overall features by injecting both low-level and high-level encoder outputs into the decoder. Each method contributes to performance improvement and generates a description such as organ size and bone structure. The proposed model was evaluated on the IU X-ray dataset and achieved better performance than the transformer-based baseline results, by 5.6% in BLEU score, by 0.56% in METEOR, and by 1.98% in ROUGE-L.
基于转换器的方法在图像字幕任务中取得了很好的效果。然而,目前的方法在从整个图像的全局特征生成文本方面存在局限性。因此,我们提出了以下生成更好的图像字幕的新方法:(1)全局-局部视觉提取器(GLVE),用于捕获全局特征和局部特征。(2)交叉编码器-解码器转换器(CEDT),用于将多个级别的编码器特征注入到解码过程中。GLVE 不仅提取了可以从整个图像获得的全局视觉特征,例如器官或骨骼结构的大小,还提取了可以从局部区域生成的局部视觉特征,例如病变区域。对于给定的图像,CEDT 可以通过将低水平和高水平的编码器输出注入解码器来创建对整体特征的详细描述。每种方法都有助于提高性能,并生成诸如器官大小和骨骼结构之类的描述。所提出的模型在 IU X 射线数据集上进行了评估,与基于转换器的基线结果相比,BLEU 得分提高了 5.6%,METEOR 提高了 0.56%,ROUGE-L 提高了 1.98%。