State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University, China.
Center for Research and Development of Fine Chemicals, Guizhou University, Guiyang, China; State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University, China.
Neural Netw. 2023 May;162:318-329. doi: 10.1016/j.neunet.2023.03.010. Epub 2023 Mar 11.
Text-based image captioning (TextCap) aims to remedy the shortcomings of existing image captioning tasks that ignore text content when describing images. Instead, it requires models to recognize and describe images from both visual and textual content to achieve a deeper level of comprehension of the images. However, existing methods tend to use numerous complex network architectures to improve performance, which still fails to adequately model the relationship between vision and text on the one side, while on the other side this leads to long running times, high memory consumption, and other unfavorable deployment problems. To solve the above issues, we have developed a lightweight captioning method with a collaborative mechanism, LCM-Captioner, which balances high efficiency with high performance. First, we propose a feature-lightening transformation for the TextCap task, named TextLighT, which is able to learn rich multimodal representations while mapping features to lower dimensions, thereby reducing memory costs. Next, we present a collaborative attention module for visual and text information, VTCAM, to facilitate the semantic alignment of multimodal information to uncover important visual objects and textual content. Finally, the conducted extensive experiments on the TextCaps dataset demonstrate the effectiveness of our method. Code is available at https://github.com/DengHY258/LCM-Captioner.
基于文本的图像字幕(TextCap)旨在弥补现有图像字幕任务的不足,这些任务在描述图像时忽略文本内容。相反,它要求模型从视觉和文本内容两方面识别和描述图像,以实现对图像的更深层次理解。然而,现有的方法往往使用大量复杂的网络架构来提高性能,这仍然不能充分地对视觉和文本之间的关系进行建模,另一方面这会导致较长的运行时间、高内存消耗和其他不利的部署问题。为了解决上述问题,我们开发了一种具有协作机制的轻量级字幕生成方法 LCM-Captioner,它在高效性和高性能之间取得了平衡。首先,我们提出了一种名为 TextLighT 的 TextCap 任务的特征轻量化转换,它能够在将特征映射到较低维度的同时学习丰富的多模态表示,从而降低内存成本。接下来,我们提出了一种用于视觉和文本信息的协作注意模块 VTCAM,以促进多模态信息的语义对齐,揭示重要的视觉对象和文本内容。最后,我们在 TextCaps 数据集上进行了广泛的实验,验证了我们方法的有效性。代码可在 https://github.com/DengHY258/LCM-Captioner 上获得。