• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于基于视频的字幕生成的考虑情感特征的类别相关和跨模态记忆网络。

Class-dependent and cross-modal memory network considering sentimental features for video-based captioning.

作者信息

Xiong Haitao, Zhou Yuchen, Liu Jiaming, Cai Yuanyuan

机构信息

School of International Economics and Management, Beijing Technology and Business University, Beijing, China.

National Engineering Research Centre for Agri-Product Quality Traceability, Beijing Technology and Business University, Beijing, China.

出版信息

Front Psychol. 2023 Feb 15;14:1124369. doi: 10.3389/fpsyg.2023.1124369. eCollection 2023.

DOI:10.3389/fpsyg.2023.1124369
PMID:36874867
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9975600/
Abstract

The video-based commonsense captioning task aims to add multiple commonsense descriptions to video captions to understand video content better. This paper aims to consider the importance of cross-modal mapping. We propose a combined framework called Class-dependent and Cross-modal Memory Network considering SENtimental features (CCMN-SEN) for Video-based Captioning to enhance commonsense caption generation. Firstly, we develop class-dependent memory for recording the alignment between video features and text. It only allows cross-modal interactions and generation on cross-modal matrices that share the same labels. Then, to understand the sentiments conveyed in the videos and generate accurate captions, we add sentiment features to facilitate commonsense caption generation. Experiment results demonstrate that our proposed CCMN-SEN significantly outperforms the state-of-the-art methods. These results have practical significance for understanding video content better.

摘要

基于视频的常识性字幕任务旨在为视频字幕添加多个常识性描述,以便更好地理解视频内容。本文旨在考虑跨模态映射的重要性。我们提出了一个名为基于类和跨模态记忆网络并考虑情感特征(CCMN-SEN)的组合框架,用于基于视频的字幕生成,以增强常识性字幕生成。首先,我们开发了基于类的记忆,用于记录视频特征和文本之间的对齐。它只允许在共享相同标签的跨模态矩阵上进行跨模态交互和生成。然后,为了理解视频中传达的情感并生成准确的字幕,我们添加情感特征以促进常识性字幕生成。实验结果表明,我们提出的CCMN-SEN显著优于现有方法。这些结果对于更好地理解视频内容具有实际意义。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/692c/9975600/8acf84f908a3/fpsyg-14-1124369-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/692c/9975600/fad3ffbba43d/fpsyg-14-1124369-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/692c/9975600/079ac5ebe678/fpsyg-14-1124369-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/692c/9975600/ecd7625b32eb/fpsyg-14-1124369-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/692c/9975600/8161bebd737d/fpsyg-14-1124369-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/692c/9975600/8acf84f908a3/fpsyg-14-1124369-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/692c/9975600/fad3ffbba43d/fpsyg-14-1124369-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/692c/9975600/079ac5ebe678/fpsyg-14-1124369-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/692c/9975600/ecd7625b32eb/fpsyg-14-1124369-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/692c/9975600/8161bebd737d/fpsyg-14-1124369-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/692c/9975600/8acf84f908a3/fpsyg-14-1124369-g005.jpg

相似文献

1
Class-dependent and cross-modal memory network considering sentimental features for video-based captioning.用于基于视频的字幕生成的考虑情感特征的类别相关和跨模态记忆网络。
Front Psychol. 2023 Feb 15;14:1124369. doi: 10.3389/fpsyg.2023.1124369. eCollection 2023.
2
Visual Commonsense-Aware Representation Network for Video Captioning.用于视频字幕的视觉常识感知表示网络。
IEEE Trans Neural Netw Learn Syst. 2025 Jan;36(1):1092-1103. doi: 10.1109/TNNLS.2023.3323491. Epub 2025 Jan 7.
3
Fusion of Multi-Modal Features to Enhance Dense Video Caption.融合多模态特征以增强密集视频字幕。
Sensors (Basel). 2023 Jun 14;23(12):5565. doi: 10.3390/s23125565.
4
Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph.基于跨模态注意力和知识增强无偏场景图的轻量级密集视频字幕
Complex Intell Systems. 2023 Feb 24:1-18. doi: 10.1007/s40747-023-00998-5.
5
Cross-Modal Graph With Meta Concepts for Video Captioning.用于视频字幕的带有元概念的跨模态图
IEEE Trans Image Process. 2022;31:5150-5162. doi: 10.1109/TIP.2022.3192709. Epub 2022 Aug 2.
6
Gaze-assisted automatic captioning of fetal ultrasound videos using three-way multi-modal deep neural networks.使用三向多模态深度神经网络的胎儿超声视频注视辅助自动字幕生成。
Med Image Anal. 2022 Nov;82:102630. doi: 10.1016/j.media.2022.102630. Epub 2022 Sep 17.
7
Event-centric multi-modal fusion method for dense video captioning.基于事件的多模态融合方法在密集视频字幕中的应用。
Neural Netw. 2022 Feb;146:120-129. doi: 10.1016/j.neunet.2021.11.017. Epub 2021 Nov 22.
8
Center-enhanced video captioning model with multimodal semantic alignment.带多模态语义对齐的中心增强视频字幕模型。
Neural Netw. 2024 Dec;180:106744. doi: 10.1016/j.neunet.2024.106744. Epub 2024 Sep 18.
9
Topic-Oriented Image Captioning Based on Order-Embedding.基于序嵌入的主题导向图像字幕生成
IEEE Trans Image Process. 2019 Jun;28(6):2743-2754. doi: 10.1109/TIP.2018.2889922. Epub 2018 Dec 27.
10
Aligning Source Visual and Target Language Domains for Unpaired Video Captioning.为无配对视频字幕对齐源视觉和目标语言领域
IEEE Trans Pattern Anal Mach Intell. 2022 Dec;44(12):9255-9268. doi: 10.1109/TPAMI.2021.3132229. Epub 2022 Nov 7.

本文引用的文献

1
Long short-term memory.长短期记忆
Neural Comput. 1997 Nov 15;9(8):1735-80. doi: 10.1162/neco.1997.9.8.1735.