• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于跨模态注意力和知识增强无偏场景图的轻量级密集视频字幕

Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph.

作者信息

Han Shixing, Liu Jin, Zhang Jinyingming, Gong Peizhu, Zhang Xiliang, He Huihua

机构信息

College of Information Engineering, Shanghai Maritime University, Shanghai, 201306 China.

College of Early Childhood Education, Shanghai Normal University, Shanghai, 200234 China.

出版信息

Complex Intell Systems. 2023 Feb 24:1-18. doi: 10.1007/s40747-023-00998-5.

DOI:10.1007/s40747-023-00998-5
PMID:36855683
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9950023/
Abstract

Dense video captioning (DVC) aims at generating description for each scene in a video. Despite attractive progress for this task, previous works usually only concentrate on exploiting visual features while neglecting audio information in the video, resulting in inaccurate scene event location. In this article, we propose a novel DVC model named CMCR, which is mainly composed of a cross-modal processing (CM) module and a commonsense reasoning (CR) module. CM utilizes a cross-modal attention mechanism to encode data in different modalities. An event refactoring algorithm is proposed to deal with inaccurate event localization caused by overlapping events. Besides, a shared encoder is utilized to reduce model redundancy. CR optimizes the logic of generated captions with both heterogeneous prior knowledge and entities' association reasoning achieved by building a knowledge-enhanced unbiased scene graph. Extensive experiments are conducted on ActivityNet Captions dataset, the results demonstrate that our model achieves better performance than state-of-the-art methods. To better understand the performance achieved by CMCR, we also apply ablation experiments to analyze the contributions of different modules.

摘要

密集视频字幕(DVC)旨在为视频中的每个场景生成描述。尽管这项任务取得了引人注目的进展,但以往的工作通常只专注于利用视觉特征,而忽略了视频中的音频信息,导致场景事件定位不准确。在本文中,我们提出了一种名为CMCR的新型DVC模型,它主要由跨模态处理(CM)模块和常识推理(CR)模块组成。CM利用跨模态注意力机制对不同模态的数据进行编码。提出了一种事件重构算法来处理由重叠事件导致的不准确事件定位。此外,还使用了一个共享编码器来减少模型冗余。CR通过构建知识增强的无偏场景图实现的异构先验知识和实体关联推理来优化生成字幕的逻辑。在ActivityNet Captions数据集上进行了大量实验,结果表明我们的模型比现有方法具有更好的性能。为了更好地理解CMCR所取得的性能,我们还进行了消融实验来分析不同模块的贡献。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8556/9950023/4c4471dd2de9/40747_2023_998_Fig19_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8556/9950023/0d98d966350c/40747_2023_998_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8556/9950023/f1790765919c/40747_2023_998_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8556/9950023/dcb40f1c9d52/40747_2023_998_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8556/9950023/dd38ae68c61e/40747_2023_998_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8556/9950023/676f6b28e1d6/40747_2023_998_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8556/9950023/23876bef4fd8/40747_2023_998_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8556/9950023/7cfb1f316d46/40747_2023_998_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8556/9950023/844766dd75aa/40747_2023_998_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8556/9950023/b5cb60fa506c/40747_2023_998_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8556/9950023/6605a6dc75ab/40747_2023_998_Fig10_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8556/9950023/67d620bb53e9/40747_2023_998_Fig11_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8556/9950023/124438431247/40747_2023_998_Fig12_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8556/9950023/6e3e56bf6102/40747_2023_998_Fig13_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8556/9950023/7fa3a2b009ed/40747_2023_998_Fig14_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8556/9950023/80834b6d42b1/40747_2023_998_Fig15_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8556/9950023/050733d6b74b/40747_2023_998_Fig16_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8556/9950023/41badeabb853/40747_2023_998_Fig17_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8556/9950023/121a84ab548d/40747_2023_998_Fig18_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8556/9950023/4c4471dd2de9/40747_2023_998_Fig19_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8556/9950023/0d98d966350c/40747_2023_998_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8556/9950023/f1790765919c/40747_2023_998_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8556/9950023/dcb40f1c9d52/40747_2023_998_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8556/9950023/dd38ae68c61e/40747_2023_998_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8556/9950023/676f6b28e1d6/40747_2023_998_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8556/9950023/23876bef4fd8/40747_2023_998_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8556/9950023/7cfb1f316d46/40747_2023_998_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8556/9950023/844766dd75aa/40747_2023_998_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8556/9950023/b5cb60fa506c/40747_2023_998_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8556/9950023/6605a6dc75ab/40747_2023_998_Fig10_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8556/9950023/67d620bb53e9/40747_2023_998_Fig11_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8556/9950023/124438431247/40747_2023_998_Fig12_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8556/9950023/6e3e56bf6102/40747_2023_998_Fig13_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8556/9950023/7fa3a2b009ed/40747_2023_998_Fig14_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8556/9950023/80834b6d42b1/40747_2023_998_Fig15_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8556/9950023/050733d6b74b/40747_2023_998_Fig16_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8556/9950023/41badeabb853/40747_2023_998_Fig17_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8556/9950023/121a84ab548d/40747_2023_998_Fig18_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8556/9950023/4c4471dd2de9/40747_2023_998_Fig19_HTML.jpg

相似文献

1
Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph.基于跨模态注意力和知识增强无偏场景图的轻量级密集视频字幕
Complex Intell Systems. 2023 Feb 24:1-18. doi: 10.1007/s40747-023-00998-5.
2
Event-centric multi-modal fusion method for dense video captioning.基于事件的多模态融合方法在密集视频字幕中的应用。
Neural Netw. 2022 Feb;146:120-129. doi: 10.1016/j.neunet.2021.11.017. Epub 2021 Nov 22.
3
Fusion of Multi-Modal Features to Enhance Dense Video Caption.融合多模态特征以增强密集视频字幕。
Sensors (Basel). 2023 Jun 14;23(12):5565. doi: 10.3390/s23125565.
4
Class-dependent and cross-modal memory network considering sentimental features for video-based captioning.用于基于视频的字幕生成的考虑情感特征的类别相关和跨模态记忆网络。
Front Psychol. 2023 Feb 15;14:1124369. doi: 10.3389/fpsyg.2023.1124369. eCollection 2023.
5
Cross-Modal Graph With Meta Concepts for Video Captioning.用于视频字幕的带有元概念的跨模态图
IEEE Trans Image Process. 2022;31:5150-5162. doi: 10.1109/TIP.2022.3192709. Epub 2022 Aug 2.
6
Learning Hierarchical Modular Networks for Video Captioning.用于视频字幕的分层模块化网络学习
IEEE Trans Pattern Anal Mach Intell. 2024 Feb;46(2):1049-1064. doi: 10.1109/TPAMI.2023.3327677. Epub 2024 Jan 9.
7
Adversarial Reinforcement Learning With Object-Scene Relational Graph for Video Captioning.用于视频字幕的基于对象-场景关系图的对抗强化学习。
IEEE Trans Image Process. 2022;31:2004-2016. doi: 10.1109/TIP.2022.3148868. Epub 2022 Feb 25.
8
Visual Commonsense-Aware Representation Network for Video Captioning.用于视频字幕的视觉常识感知表示网络。
IEEE Trans Neural Netw Learn Syst. 2025 Jan;36(1):1092-1103. doi: 10.1109/TNNLS.2023.3323491. Epub 2025 Jan 7.
9
Dense Relational Image Captioning via Multi-Task Triple-Stream Networks.通过多任务三流网络实现密集关系图像字幕生成
IEEE Trans Pattern Anal Mach Intell. 2022 Nov;44(11):7348-7362. doi: 10.1109/TPAMI.2021.3119754. Epub 2022 Oct 4.
10
Auto-Encoding and Distilling Scene Graphs for Image Captioning.自动编码和场景图蒸馏用于图像字幕生成。
IEEE Trans Pattern Anal Mach Intell. 2022 May;44(5):2313-2327. doi: 10.1109/TPAMI.2020.3042192. Epub 2022 Apr 1.

引用本文的文献

1
VT-3DCapsNet: Visual tempos 3D-Capsule network for video-based facial expression recognition.VT-3DCapsNet:基于视频的面部表情识别的视觉时态 3D 胶囊网络。
PLoS One. 2024 Aug 23;19(8):e0307446. doi: 10.1371/journal.pone.0307446. eCollection 2024.

本文引用的文献

1
Mask R-CNN.Mask R-CNN。
IEEE Trans Pattern Anal Mach Intell. 2020 Feb;42(2):386-397. doi: 10.1109/TPAMI.2018.2844175. Epub 2018 Jun 5.