• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

LCM-Captioner:一种轻量级基于文本的图像字幕生成方法,具有视觉和文本之间的协作机制。

LCM-Captioner: A lightweight text-based image captioning method with collaborative mechanism between vision and text.

机构信息

State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University, China.

Center for Research and Development of Fine Chemicals, Guizhou University, Guiyang, China; State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University, China.

出版信息

Neural Netw. 2023 May;162:318-329. doi: 10.1016/j.neunet.2023.03.010. Epub 2023 Mar 11.

DOI:10.1016/j.neunet.2023.03.010
PMID:36934693
Abstract

Text-based image captioning (TextCap) aims to remedy the shortcomings of existing image captioning tasks that ignore text content when describing images. Instead, it requires models to recognize and describe images from both visual and textual content to achieve a deeper level of comprehension of the images. However, existing methods tend to use numerous complex network architectures to improve performance, which still fails to adequately model the relationship between vision and text on the one side, while on the other side this leads to long running times, high memory consumption, and other unfavorable deployment problems. To solve the above issues, we have developed a lightweight captioning method with a collaborative mechanism, LCM-Captioner, which balances high efficiency with high performance. First, we propose a feature-lightening transformation for the TextCap task, named TextLighT, which is able to learn rich multimodal representations while mapping features to lower dimensions, thereby reducing memory costs. Next, we present a collaborative attention module for visual and text information, VTCAM, to facilitate the semantic alignment of multimodal information to uncover important visual objects and textual content. Finally, the conducted extensive experiments on the TextCaps dataset demonstrate the effectiveness of our method. Code is available at https://github.com/DengHY258/LCM-Captioner.

摘要

基于文本的图像字幕(TextCap)旨在弥补现有图像字幕任务的不足,这些任务在描述图像时忽略文本内容。相反,它要求模型从视觉和文本内容两方面识别和描述图像,以实现对图像的更深层次理解。然而,现有的方法往往使用大量复杂的网络架构来提高性能,这仍然不能充分地对视觉和文本之间的关系进行建模,另一方面这会导致较长的运行时间、高内存消耗和其他不利的部署问题。为了解决上述问题,我们开发了一种具有协作机制的轻量级字幕生成方法 LCM-Captioner,它在高效性和高性能之间取得了平衡。首先,我们提出了一种名为 TextLighT 的 TextCap 任务的特征轻量化转换,它能够在将特征映射到较低维度的同时学习丰富的多模态表示,从而降低内存成本。接下来,我们提出了一种用于视觉和文本信息的协作注意模块 VTCAM,以促进多模态信息的语义对齐,揭示重要的视觉对象和文本内容。最后,我们在 TextCaps 数据集上进行了广泛的实验,验证了我们方法的有效性。代码可在 https://github.com/DengHY258/LCM-Captioner 上获得。

相似文献

1
LCM-Captioner: A lightweight text-based image captioning method with collaborative mechanism between vision and text.LCM-Captioner:一种轻量级基于文本的图像字幕生成方法,具有视觉和文本之间的协作机制。
Neural Netw. 2023 May;162:318-329. doi: 10.1016/j.neunet.2023.03.010. Epub 2023 Mar 11.
2
Thangka Image Captioning Based on Semantic Concept Prompt and Multimodal Feature Optimization.基于语义概念提示和多模态特征优化的唐卡图像字幕
J Imaging. 2023 Aug 16;9(8):162. doi: 10.3390/jimaging9080162.
3
Context-Fused Guidance for Image Captioning Using Sequence-Level Training.基于序列级训练的上下文融合图像字幕生成
Comput Intell Neurosci. 2022 Jan 5;2022:9743123. doi: 10.1155/2022/9743123. eCollection 2022.
4
Visual Cluster Grounding for Image Captioning.用于图像字幕的视觉聚类基础
IEEE Trans Image Process. 2022;31:3920-3934. doi: 10.1109/TIP.2022.3177318. Epub 2022 Jun 9.
5
Image Captioning with End-to-end Attribute Detection and Subsequent Attributes Prediction.基于端到端属性检测及后续属性预测的图像字幕生成
IEEE Trans Image Process. 2020 Jan 30. doi: 10.1109/TIP.2020.2969330.
6
Dual Global Enhanced Transformer for image captioning.双全局增强型 Transformer 用于图像字幕生成。
Neural Netw. 2022 Apr;148:129-141. doi: 10.1016/j.neunet.2022.01.011. Epub 2022 Jan 21.
7
Memorial GAN With Joint Semantic Optimization for Unpaired Image Captioning.用于无配对图像字幕的联合语义优化的纪念生成对抗网络
IEEE Trans Cybern. 2023 Jul;53(7):4388-4399. doi: 10.1109/TCYB.2022.3175012. Epub 2023 Jun 15.
8
Chinese Image Caption Generation via Visual Attention and Topic Modeling.基于视觉注意和主题建模的中文图像字幕生成。
IEEE Trans Cybern. 2022 Feb;52(2):1247-1257. doi: 10.1109/TCYB.2020.2997034. Epub 2022 Feb 16.
9
Arabic Captioning for Images of Clothing Using Deep Learning.基于深度学习的服装图像阿拉伯语字幕生成。
Sensors (Basel). 2023 Apr 7;23(8):3783. doi: 10.3390/s23083783.
10
Self-Guiding Multimodal LSTM-When We Do Not Have a Perfect Training Dataset for Image Captioning.自我引导多模态长短期记忆网络——当我们没有用于图像字幕的完美训练数据集时。
IEEE Trans Image Process. 2019 Nov;28(11):5241-5252. doi: 10.1109/TIP.2019.2917229. Epub 2019 May 22.

引用本文的文献

1
Auto-LIA: The Automated Vision-Based Leaf Inclination Angle Measurement System Improves Monitoring of Plant Physiology.自动叶倾角测量系统(Auto-LIA):基于视觉的自动叶片倾角测量系统改善了对植物生理状况的监测。
Plant Phenomics. 2024 Sep 11;6:0245. doi: 10.34133/plantphenomics.0245. eCollection 2024.
2
CSNet: A Count-Supervised Network via Multiscale MLP-Mixer for Wheat Ear Counting.CSNet:一种通过多尺度MLP-Mixer实现麦穗计数的计数监督网络。
Plant Phenomics. 2024 Aug 20;6:0236. doi: 10.34133/plantphenomics.0236. eCollection 2024.