• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于建筑场景图像字幕的风格增强Transformer

Style-Enhanced Transformer for Image Captioning in Construction Scenes.

作者信息

Song Kani, Chen Linlin, Wang Hengyou

机构信息

School of Science, Beijing University of Civil Engineering and Architecture, Beijing 100044, China.

出版信息

Entropy (Basel). 2024 Mar 1;26(3):224. doi: 10.3390/e26030224.

DOI:10.3390/e26030224
PMID:38539736
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10969170/
Abstract

Image captioning is important for improving the intelligence of construction projects and assisting managers in mastering construction site activities. However, there are few image-captioning models for construction scenes at present, and the existing methods do not perform well in complex construction scenes. According to the characteristics of construction scenes, we label a text description dataset based on the MOCS dataset and propose a style-enhanced Transformer for image captioning in construction scenes, simply called SETCAP. Specifically, we extract the grid features using the Swin Transformer. Then, to enhance the style information, we not only use the grid features as the initial detail semantic features but also extract style information by style encoder. In addition, in the decoder, we integrate the style information into the text features. The interaction between the image semantic information and the text features is carried out to generate content-appropriate sentences word by word. Finally, we add the sentence style loss into the total loss function to make the style of generated sentences closer to the training set. The experimental results show that the proposed method achieves encouraging results on both the MSCOCO and the MOCS datasets. In particular, SETCAP outperforms state-of-the-art methods by 4.2% CIDEr scores on the MOCS dataset and 3.9% CIDEr scores on the MSCOCO dataset, respectively.

摘要

图像字幕对于提升建设项目的智能化水平以及辅助管理人员掌握施工现场活动至关重要。然而,目前针对施工场景的图像字幕模型较少,且现有方法在复杂施工场景中表现不佳。根据施工场景的特点,我们基于MOCS数据集标注了一个文本描述数据集,并提出了一种用于施工场景图像字幕的风格增强Transformer,简称为SETCAP。具体而言,我们使用Swin Transformer提取网格特征。然后,为了增强风格信息,我们不仅将网格特征用作初始细节语义特征,还通过风格编码器提取风格信息。此外,在解码器中,我们将风格信息整合到文本特征中。通过图像语义信息与文本特征之间的交互逐词生成内容合适的句子。最后,我们将句子风格损失添加到总损失函数中,以使生成句子的风格更接近训练集。实验结果表明,所提出的方法在MSCOCO和MOCS数据集上均取得了令人鼓舞的结果。特别是,SETCAP在MOCS数据集上的CIDEr分数比现有最先进方法高出4.2%,在MSCOCO数据集上高出3.9%。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/942a/10969170/0bf8c754bc5c/entropy-26-00224-g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/942a/10969170/01155dec26d3/entropy-26-00224-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/942a/10969170/d823d3d0bd55/entropy-26-00224-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/942a/10969170/e9bab1fa20fb/entropy-26-00224-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/942a/10969170/12de5b58227f/entropy-26-00224-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/942a/10969170/20f6209d7433/entropy-26-00224-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/942a/10969170/51192789b5ac/entropy-26-00224-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/942a/10969170/f8726d1539a4/entropy-26-00224-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/942a/10969170/0bf8c754bc5c/entropy-26-00224-g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/942a/10969170/01155dec26d3/entropy-26-00224-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/942a/10969170/d823d3d0bd55/entropy-26-00224-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/942a/10969170/e9bab1fa20fb/entropy-26-00224-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/942a/10969170/12de5b58227f/entropy-26-00224-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/942a/10969170/20f6209d7433/entropy-26-00224-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/942a/10969170/51192789b5ac/entropy-26-00224-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/942a/10969170/f8726d1539a4/entropy-26-00224-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/942a/10969170/0bf8c754bc5c/entropy-26-00224-g008.jpg

相似文献

1
Style-Enhanced Transformer for Image Captioning in Construction Scenes.用于建筑场景图像字幕的风格增强Transformer
Entropy (Basel). 2024 Mar 1;26(3):224. doi: 10.3390/e26030224.
2
Dual Global Enhanced Transformer for image captioning.双全局增强型 Transformer 用于图像字幕生成。
Neural Netw. 2022 Apr;148:129-141. doi: 10.1016/j.neunet.2022.01.011. Epub 2022 Jan 21.
3
Dual Position Relationship Transformer for Image Captioning.用于图像字幕的双位置关系变换器
Big Data. 2022 Dec;10(6):515-527. doi: 10.1089/big.2021.0262. Epub 2022 Jan 4.
4
Cross Encoder-Decoder Transformer with Global-Local Visual Extractor for Medical Image Captioning.交叉编解码器-解码器转换器与全局-局部视觉提取器用于医学图像字幕。
Sensors (Basel). 2022 Feb 13;22(4):1429. doi: 10.3390/s22041429.
5
Novel concept-based image captioning models using LSTM and multi-encoder transformer architecture.使用长短期记忆网络(LSTM)和多编码器变压器架构的基于新颖概念的图像字幕模型。
Sci Rep. 2024 Sep 5;14(1):20762. doi: 10.1038/s41598-024-69664-1.
6
Attention-Guided Image Captioning through Word Information.基于词信息的注意力引导图像字幕生成。
Sensors (Basel). 2021 Nov 30;21(23):7982. doi: 10.3390/s21237982.
7
Thangka Image Captioning Based on Semantic Concept Prompt and Multimodal Feature Optimization.基于语义概念提示和多模态特征优化的唐卡图像字幕
J Imaging. 2023 Aug 16;9(8):162. doi: 10.3390/jimaging9080162.
8
Image Captioning with End-to-end Attribute Detection and Subsequent Attributes Prediction.基于端到端属性检测及后续属性预测的图像字幕生成
IEEE Trans Image Process. 2020 Jan 30. doi: 10.1109/TIP.2020.2969330.
9
Adaptive Semantic-Enhanced Transformer for Image Captioning.用于图像字幕的自适应语义增强Transformer
IEEE Trans Neural Netw Learn Syst. 2024 Feb;35(2):1785-1796. doi: 10.1109/TNNLS.2022.3185320. Epub 2024 Feb 5.
10
Auto-Encoding and Distilling Scene Graphs for Image Captioning.自动编码和场景图蒸馏用于图像字幕生成。
IEEE Trans Pattern Anal Mach Intell. 2022 May;44(5):2313-2327. doi: 10.1109/TPAMI.2020.3042192. Epub 2022 Apr 1.

本文引用的文献

1
Babytalk: understanding and generating simple image descriptions.婴儿语:理解和生成简单的图像描述。
IEEE Trans Pattern Anal Mach Intell. 2013 Dec;35(12):2891-903. doi: 10.1109/TPAMI.2012.162.