Umirzakova Sabina, Muksimova Shakhnoza, Mardieva Sevara, Sultanov Baxtiyarovich Murodjon, Cho Young-Im
Department of Computer Engineering, Gachon University, Sujeong-gu, Seongnam-si 13120, Republic of Korea.
Department of Information Systems and Technologies, Tashkent State University of Economics, Tashkent 100066, Uzbekistan.
Sensors (Basel). 2024 Dec 15;24(24):8013. doi: 10.3390/s24248013.
Generating accurate and contextually rich captions for images and videos is essential for various applications, from assistive technology to content recommendation. However, challenges such as maintaining temporal coherence in videos, reducing noise in large-scale datasets, and enabling real-time captioning remain significant. We introduce MIRA-CAP (Memory-Integrated Retrieval-Augmented Captioning), a novel framework designed to address these issues through three core innovations: a cross-modal memory bank, adaptive dataset pruning, and a streaming decoder. The cross-modal memory bank retrieves relevant context from prior frames, enhancing temporal consistency and narrative flow. The adaptive pruning mechanism filters noisy data, which improves alignment and generalization. The streaming decoder allows for real-time captioning by generating captions incrementally, without requiring access to the full video sequence. Evaluated across standard datasets like MS COCO, YouCook2, ActivityNet, and Flickr30k, MIRA-CAP achieves state-of-the-art results, with high scores on CIDEr, SPICE, and Polos metrics, underscoring its alignment with human judgment and its effectiveness in handling complex visual and temporal structures. This work demonstrates that MIRA-CAP offers a robust, scalable solution for both static and dynamic captioning tasks, advancing the capabilities of vision-language models in real-world applications.
为图像和视频生成准确且上下文丰富的字幕对于从辅助技术到内容推荐等各种应用至关重要。然而,诸如在视频中保持时间连贯性、减少大规模数据集中的噪声以及实现实时字幕等挑战仍然很大。我们引入了MIRA-CAP(内存集成检索增强字幕),这是一个新颖的框架,旨在通过三项核心创新来解决这些问题:跨模态内存库、自适应数据集修剪和流式解码器。跨模态内存库从先前的帧中检索相关上下文,增强时间一致性和叙事流畅性。自适应修剪机制过滤噪声数据,从而改善对齐和泛化。流式解码器通过逐步生成字幕实现实时字幕,而无需访问完整的视频序列。在MS COCO、YouCook2、ActivityNet和Flickr30k等标准数据集上进行评估时,MIRA-CAP取得了领先的成果,在CIDEr、SPICE和Polos指标上得分很高,突出了其与人类判断的一致性以及在处理复杂视觉和时间结构方面的有效性。这项工作表明,MIRA-CAP为静态和动态字幕任务提供了一个强大、可扩展的解决方案,提升了视觉语言模型在实际应用中的能力。