LCM-Captioner：一种轻量级基于文本的图像字幕生成方法，具有视觉和文本之间的协作机制。

LCM-Captioner: A lightweight text-based image captioning method with collaborative mechanism between vision and text.

机构信息

State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University, China.

Center for Research and Development of Fine Chemicals, Guizhou University, Guiyang, China; State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University, China.

出版信息

Neural Netw. 2023 May;162:318-329. doi: 10.1016/j.neunet.2023.03.010. Epub 2023 Mar 11.

DOI:10.1016/j.neunet.2023.03.010

PMID:36934693

Abstract

Text-based image captioning (TextCap) aims to remedy the shortcomings of existing image captioning tasks that ignore text content when describing images. Instead, it requires models to recognize and describe images from both visual and textual content to achieve a deeper level of comprehension of the images. However, existing methods tend to use numerous complex network architectures to improve performance, which still fails to adequately model the relationship between vision and text on the one side, while on the other side this leads to long running times, high memory consumption, and other unfavorable deployment problems. To solve the above issues, we have developed a lightweight captioning method with a collaborative mechanism, LCM-Captioner, which balances high efficiency with high performance. First, we propose a feature-lightening transformation for the TextCap task, named TextLighT, which is able to learn rich multimodal representations while mapping features to lower dimensions, thereby reducing memory costs. Next, we present a collaborative attention module for visual and text information, VTCAM, to facilitate the semantic alignment of multimodal information to uncover important visual objects and textual content. Finally, the conducted extensive experiments on the TextCaps dataset demonstrate the effectiveness of our method. Code is available at https://github.com/DengHY258/LCM-Captioner.

摘要

基于文本的图像字幕（TextCap）旨在弥补现有图像字幕任务的不足，这些任务在描述图像时忽略文本内容。相反，它要求模型从视觉和文本内容两方面识别和描述图像，以实现对图像的更深层次理解。然而，现有的方法往往使用大量复杂的网络架构来提高性能，这仍然不能充分地对视觉和文本之间的关系进行建模，另一方面这会导致较长的运行时间、高内存消耗和其他不利的部署问题。为了解决上述问题，我们开发了一种具有协作机制的轻量级字幕生成方法 LCM-Captioner，它在高效性和高性能之间取得了平衡。首先，我们提出了一种名为 TextLighT 的 TextCap 任务的特征轻量化转换，它能够在将特征映射到较低维度的同时学习丰富的多模态表示，从而降低内存成本。接下来，我们提出了一种用于视觉和文本信息的协作注意模块 VTCAM，以促进多模态信息的语义对齐，揭示重要的视觉对象和文本内容。最后，我们在 TextCaps 数据集上进行了广泛的实验，验证了我们方法的有效性。代码可在 https://github.com/DengHY258/LCM-Captioner 上获得。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

LCM-Captioner：一种轻量级基于文本的图像字幕生成方法，具有视觉和文本之间的协作机制。

LCM-Captioner: A lightweight text-based image captioning method with collaborative mechanism between vision and text.

机构信息

出版信息

相似文献

引用本文的文献

LCM-Captioner：一种轻量级基于文本的图像字幕生成方法，具有视觉和文本之间的协作机制。

LCM-Captioner: A lightweight text-based image captioning method with collaborative mechanism between vision and text.

机构信息

出版信息

相似文献

引用本文的文献