• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于端到端属性检测及后续属性预测的图像字幕生成

Image Captioning with End-to-end Attribute Detection and Subsequent Attributes Prediction.

作者信息

Huang Yiqing, Chen Jiansheng, Ouyang Wanli, Wan Weitao, Xue Youze

出版信息

IEEE Trans Image Process. 2020 Jan 30. doi: 10.1109/TIP.2020.2969330.

DOI:10.1109/TIP.2020.2969330
PMID:32012014
Abstract

Semantic attention has been shown to be effective in improving the performance of image captioning. The core of semantic attention based methods is to drive the model to attend to semantically important words, or attributes. In previous works, the attribute detector and the captioning network are usually independent, leading to the insufficient usage of the semantic information. Also, all the detected attributes, no matter whether they are appropriate for the linguistic context at the current step, are attended to through the whole caption generation process. This may sometimes disrupt the captioning model to attend to incorrect visual concepts. To solve these problems, we introduce two end-to-end trainable modules to closely couple attribute detection with image captioning as well as prompt the effective uses of attributes by predicting appropriate attributes at each time step. The multimodal attribute detector (MAD) module improves the attribute detection accuracy by using not only the image features but also the word embedding of attributes already existing in most captioning models. MAD models the similarity between the semantics of attributes and the image object features to facilitate accurate detection. The subsequent attribute predictor (SAP) module dynamically predicts a concise attribute subset at each time step to mitigate the diversity of image attributes. Compared to previous attribute based methods, our approach enhances the explainability in how the attributes affect the generated words and achieves a state-of-the-art single model performance of 128.8 CIDEr-D on the MSCOCO dataset. Extensive experiments on the MSCOCO dataset show that our proposal actually improves the performances in both image captioning and attribute detection simultaneously. The codes are available at: https://github.com/ RubickH/Image-Captioning-with-MAD-and-SAP.

摘要

语义注意力已被证明在提高图像字幕性能方面是有效的。基于语义注意力的方法的核心是驱动模型关注语义上重要的单词或属性。在先前的工作中,属性检测器和字幕网络通常是独立的,导致语义信息使用不足。此外,所有检测到的属性,无论它们是否适合当前步骤的语言上下文,都会在整个字幕生成过程中被关注。这有时可能会干扰字幕模型关注不正确的视觉概念。为了解决这些问题,我们引入了两个端到端可训练模块,将属性检测与图像字幕紧密结合,并通过在每个时间步预测适当的属性来促进属性的有效使用。多模态属性检测器(MAD)模块不仅使用图像特征,还使用大多数字幕模型中已有的属性词嵌入来提高属性检测精度。MAD对属性语义与图像对象特征之间的相似性进行建模,以促进准确检测。随后的属性预测器(SAP)模块在每个时间步动态预测一个简洁的属性子集,以减轻图像属性的多样性。与先前基于属性的方法相比,我们的方法增强了属性如何影响生成单词的可解释性,并在MSCOCO数据集上实现了128.8 CIDEr-D的单模型性能。在MSCOCO数据集上进行的大量实验表明,我们的提议实际上同时提高了图像字幕和属性检测的性能。代码可在以下网址获取:https://github.com/ RubickH/Image-Captioning-with-MAD-and-SAP 。

相似文献

1
Image Captioning with End-to-end Attribute Detection and Subsequent Attributes Prediction.基于端到端属性检测及后续属性预测的图像字幕生成
IEEE Trans Image Process. 2020 Jan 30. doi: 10.1109/TIP.2020.2969330.
2
Image Captioning Using Motion-CNN with Object Detection.基于运动卷积神经网络的图像字幕生成与目标检测
Sensors (Basel). 2021 Feb 10;21(4):1270. doi: 10.3390/s21041270.
3
Caps Captioning: A Modern Image Captioning Approach Based on Improved Capsule Network.标题生成:一种基于改进胶囊网络的现代图像标题生成方法。
Sensors (Basel). 2022 Nov 1;22(21):8376. doi: 10.3390/s22218376.
4
More is Better: Precise and Detailed Image Captioning Using Online Positive Recall and Missing Concepts Mining.越多越好:使用在线正例召回和缺失概念挖掘实现精确详细的图像标注。
IEEE Trans Image Process. 2019 Jan;28(1):32-44. doi: 10.1109/TIP.2018.2855415. Epub 2018 Jul 12.
5
Thangka Image Captioning Based on Semantic Concept Prompt and Multimodal Feature Optimization.基于语义概念提示和多模态特征优化的唐卡图像字幕
J Imaging. 2023 Aug 16;9(8):162. doi: 10.3390/jimaging9080162.
6
Adaptive Semantic-Enhanced Transformer for Image Captioning.用于图像字幕的自适应语义增强Transformer
IEEE Trans Neural Netw Learn Syst. 2024 Feb;35(2):1785-1796. doi: 10.1109/TNNLS.2022.3185320. Epub 2024 Feb 5.
7
Extracting Effective Image Attributes with Refined Universal Detection.精修通用检测以提取有效图像属性。
Sensors (Basel). 2020 Dec 25;21(1):95. doi: 10.3390/s21010095.
8
Attention-Guided Image Captioning through Word Information.基于词信息的注意力引导图像字幕生成。
Sensors (Basel). 2021 Nov 30;21(23):7982. doi: 10.3390/s21237982.
9
Style-Enhanced Transformer for Image Captioning in Construction Scenes.用于建筑场景图像字幕的风格增强Transformer
Entropy (Basel). 2024 Mar 1;26(3):224. doi: 10.3390/e26030224.
10
Context-Aware Visual Policy Network for Fine-Grained Image Captioning.上下文感知视觉策略网络在细粒度图像标题生成中的应用
IEEE Trans Pattern Anal Mach Intell. 2022 Feb;44(2):710-722. doi: 10.1109/TPAMI.2019.2909864. Epub 2022 Jan 7.

引用本文的文献

1
Path Planning Generator with Metadata through a Domain Change by GAN between Physical and Virtual Environments.通过物理和虚拟环境之间的 GAN 进行域变换的带元数据的路径规划生成器。
Sensors (Basel). 2021 Nov 18;21(22):7667. doi: 10.3390/s21227667.
2
Extracting Effective Image Attributes with Refined Universal Detection.精修通用检测以提取有效图像属性。
Sensors (Basel). 2020 Dec 25;21(1):95. doi: 10.3390/s21010095.