Abdal Hafeth Deema, Kollias Stefanos
School of Computer Science, University of Lincoln, Lincoln LN6 7TS, UK.
School of Electrical & Computer Engineering, National Technical University of Athens, 15780 Athens, Greece.
Sensors (Basel). 2024 Mar 11;24(6):1796. doi: 10.3390/s24061796.
Image captioning is a technique used to generate descriptive captions for images. Typically, it involves employing a Convolutional Neural Network (CNN) as the encoder to extract visual features, and a decoder model, often based on Recurrent Neural Networks (RNNs), to generate the captions. Recently, the encoder-decoder architecture has witnessed the widespread adoption of the self-attention mechanism. However, this approach faces certain challenges that require further research. One such challenge is that the extracted visual features do not fully exploit the available image information, primarily due to the absence of semantic concepts. This limitation restricts the ability to fully comprehend the content depicted in the image. To address this issue, we present a new image-Transformer-based model boosted with image object semantic representation. Our model incorporates semantic representation in encoder attention, enhancing visual features by integrating instance-level concepts. Additionally, we employ Transformer as the decoder in the language generation module. By doing so, we achieve improved performance in generating accurate and diverse captions. We evaluated the performance of our model on the MS-COCO and novel MACE datasets. The results illustrate that our model aligns with state-of-the-art approaches in terms of caption generation.
图像字幕是一种用于为图像生成描述性字幕的技术。通常,它涉及使用卷积神经网络(CNN)作为编码器来提取视觉特征,并使用通常基于循环神经网络(RNN)的解码器模型来生成字幕。最近,编码器-解码器架构见证了自注意力机制的广泛采用。然而,这种方法面临一些需要进一步研究的挑战。其中一个挑战是,提取的视觉特征没有充分利用可用的图像信息,主要是由于缺乏语义概念。这种限制限制了完全理解图像中所描绘内容的能力。为了解决这个问题,我们提出了一种基于图像Transformer并增强了图像对象语义表示的新模型。我们的模型在编码器注意力中纳入语义表示,通过整合实例级概念来增强视觉特征。此外,我们在语言生成模块中使用Transformer作为解码器。通过这样做,我们在生成准确且多样的字幕方面取得了更好的性能。我们在MS-COCO和新型MACE数据集上评估了我们模型的性能。结果表明,我们的模型在字幕生成方面与当前的先进方法相当。