用于建筑场景图像字幕的风格增强Transformer

Style-Enhanced Transformer for Image Captioning in Construction Scenes.

作者信息

Song Kani, Chen Linlin, Wang Hengyou

机构信息

School of Science, Beijing University of Civil Engineering and Architecture, Beijing 100044, China.

出版信息

Entropy (Basel). 2024 Mar 1;26(3):224. doi: 10.3390/e26030224.

DOI:10.3390/e26030224

PMID:38539736

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10969170/

Abstract

Image captioning is important for improving the intelligence of construction projects and assisting managers in mastering construction site activities. However, there are few image-captioning models for construction scenes at present, and the existing methods do not perform well in complex construction scenes. According to the characteristics of construction scenes, we label a text description dataset based on the MOCS dataset and propose a style-enhanced Transformer for image captioning in construction scenes, simply called SETCAP. Specifically, we extract the grid features using the Swin Transformer. Then, to enhance the style information, we not only use the grid features as the initial detail semantic features but also extract style information by style encoder. In addition, in the decoder, we integrate the style information into the text features. The interaction between the image semantic information and the text features is carried out to generate content-appropriate sentences word by word. Finally, we add the sentence style loss into the total loss function to make the style of generated sentences closer to the training set. The experimental results show that the proposed method achieves encouraging results on both the MSCOCO and the MOCS datasets. In particular, SETCAP outperforms state-of-the-art methods by 4.2% CIDEr scores on the MOCS dataset and 3.9% CIDEr scores on the MSCOCO dataset, respectively.

摘要

图像字幕对于提升建设项目的智能化水平以及辅助管理人员掌握施工现场活动至关重要。然而，目前针对施工场景的图像字幕模型较少，且现有方法在复杂施工场景中表现不佳。根据施工场景的特点，我们基于MOCS数据集标注了一个文本描述数据集，并提出了一种用于施工场景图像字幕的风格增强Transformer，简称为SETCAP。具体而言，我们使用Swin Transformer提取网格特征。然后，为了增强风格信息，我们不仅将网格特征用作初始细节语义特征，还通过风格编码器提取风格信息。此外，在解码器中，我们将风格信息整合到文本特征中。通过图像语义信息与文本特征之间的交互逐词生成内容合适的句子。最后，我们将句子风格损失添加到总损失函数中，以使生成句子的风格更接近训练集。实验结果表明，所提出的方法在MSCOCO和MOCS数据集上均取得了令人鼓舞的结果。特别是，SETCAP在MOCS数据集上的CIDEr分数比现有最先进方法高出4.2%，在MSCOCO数据集上高出3.9%。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

用于建筑场景图像字幕的风格增强Transformer

Style-Enhanced Transformer for Image Captioning in Construction Scenes.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

用于建筑场景图像字幕的风格增强Transformer

Style-Enhanced Transformer for Image Captioning in Construction Scenes.

作者信息

机构信息

出版信息

相似文献

本文引用的文献