• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于跨模态融合的短视频分类框架

A Short Video Classification Framework Based on Cross-Modal Fusion.

作者信息

Pang Nuo, Guo Songlin, Yan Ming, Chan Chien Aun

机构信息

School of Design, Dalian University of Science and Technology, Dalian 116052, China.

School of Information and Communications Engineering, Communication University of China, Beijing 100024, China.

出版信息

Sensors (Basel). 2023 Oct 12;23(20):8425. doi: 10.3390/s23208425.

DOI:10.3390/s23208425
PMID:37896519
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10611385/
Abstract

The explosive growth of online short videos has brought great challenges to the efficient management of video content classification, retrieval, and recommendation. Video features for video management can be extracted from video image frames by various algorithms, and they have been proven to be effective in the video classification of sensor systems. However, frame-by-frame processing of video image frames not only requires huge computing power, but also classification algorithms based on a single modality of video features cannot meet the accuracy requirements in specific scenarios. In response to these concerns, we introduce a short video categorization architecture centered around cross-modal fusion in visual sensor systems which jointly utilizes video features and text features to classify short videos, avoiding processing a large number of image frames during classification. Firstly, the image space is extended to three-dimensional space-time by a self-attention mechanism, and a series of patches are extracted from a single image frame. Each patch is linearly mapped into the embedding layer of the Timesformer network and augmented with positional information to extract video features. Second, the text features of subtitles are extracted through the bidirectional encoder representation from the Transformers (BERT) pre-training model. Finally, cross-modal fusion is performed based on the extracted video and text features, resulting in improved accuracy for short video classification tasks. The outcomes of our experiments showcase a substantial superiority of our introduced classification framework compared to alternative baseline video classification methodologies. This framework can be applied in sensor systems for potential video classification.

摘要

在线短视频的爆炸式增长给视频内容分类、检索和推荐的高效管理带来了巨大挑战。用于视频管理的视频特征可以通过各种算法从视频图像帧中提取,并且它们已被证明在传感器系统的视频分类中是有效的。然而,对视频图像帧进行逐帧处理不仅需要巨大的计算能力,而且基于单一视频特征模态的分类算法在特定场景下无法满足精度要求。针对这些问题,我们引入了一种以视觉传感器系统中的跨模态融合为核心的短视频分类架构,该架构联合利用视频特征和文本特征对短视频进行分类,避免在分类过程中处理大量图像帧。首先,通过自注意力机制将图像空间扩展到三维时空,并从单个图像帧中提取一系列图像块。每个图像块被线性映射到Timesformer网络的嵌入层,并添加位置信息以提取视频特征。其次,通过双向编码器表征从变换器(BERT)预训练模型中提取字幕的文本特征。最后,基于提取的视频和文本特征进行跨模态融合,从而提高短视频分类任务的准确率。我们的实验结果表明,与其他基线视频分类方法相比,我们引入的分类框架具有显著优势。该框架可应用于传感器系统中进行潜在的视频分类。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/24e1/10611385/8a08a0afc999/sensors-23-08425-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/24e1/10611385/dd077eaee353/sensors-23-08425-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/24e1/10611385/24e2cbfb56c7/sensors-23-08425-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/24e1/10611385/ce87cb7979d9/sensors-23-08425-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/24e1/10611385/2ed80c9d9b26/sensors-23-08425-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/24e1/10611385/f6a7e8ab882f/sensors-23-08425-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/24e1/10611385/8a08a0afc999/sensors-23-08425-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/24e1/10611385/dd077eaee353/sensors-23-08425-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/24e1/10611385/24e2cbfb56c7/sensors-23-08425-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/24e1/10611385/ce87cb7979d9/sensors-23-08425-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/24e1/10611385/2ed80c9d9b26/sensors-23-08425-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/24e1/10611385/f6a7e8ab882f/sensors-23-08425-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/24e1/10611385/8a08a0afc999/sensors-23-08425-g006.jpg

相似文献

1
A Short Video Classification Framework Based on Cross-Modal Fusion.基于跨模态融合的短视频分类框架
Sensors (Basel). 2023 Oct 12;23(20):8425. doi: 10.3390/s23208425.
2
A cross-modal conditional mechanism based on attention for text-video retrieval.
Math Biosci Eng. 2023 Nov 3;20(11):20073-20092. doi: 10.3934/mbe.2023889.
3
Multi-Label Classification in Patient-Doctor Dialogues With the RoBERTa-WWM-ext + CNN (Robustly Optimized Bidirectional Encoder Representations From Transformers Pretraining Approach With Whole Word Masking Extended Combining a Convolutional Neural Network) Model: Named Entity Study.基于RoBERTa-WWM-ext + CNN(带有全词掩码扩展的基于变换器预训练方法的稳健优化双向编码器表示与卷积神经网络相结合)模型的医患对话多标签分类:命名实体研究
JMIR Med Inform. 2022 Apr 21;10(4):e35606. doi: 10.2196/35606.
4
Modality attention fusion model with hybrid multi-head self-attention for video understanding.模态注意力融合模型,采用混合多头自注意力机制,用于视频理解。
PLoS One. 2022 Oct 6;17(10):e0275156. doi: 10.1371/journal.pone.0275156. eCollection 2022.
5
Referring Segmentation via Encoder-Fused Cross-Modal Attention Network.基于编码器融合跨模态注意力网络的引用分割。
IEEE Trans Pattern Anal Mach Intell. 2023 Jun;45(6):7654-7667. doi: 10.1109/TPAMI.2022.3221387. Epub 2023 May 5.
6
Semantics-Aware Spatial-Temporal Binaries for Cross-Modal Video Retrieval.用于跨模态视频检索的语义感知时空二进制编码
IEEE Trans Image Process. 2021;30:2989-3004. doi: 10.1109/TIP.2020.3048680. Epub 2021 Feb 18.
7
Exploiting Images for Video Recognition: Heterogeneous Feature Augmentation via Symmetric Adversarial Learning.利用图像进行视频识别:通过对称对抗学习实现异构特征增强
IEEE Trans Image Process. 2019 Nov;28(11):5308-5321. doi: 10.1109/TIP.2019.2917867. Epub 2019 May 24.
8
Deep Semantic Multimodal Hashing Network for Scalable Image-Text and Video-Text Retrievals.用于可扩展图像-文本和视频-文本检索的深度语义多模态哈希网络
IEEE Trans Neural Netw Learn Syst. 2023 Apr;34(4):1838-1851. doi: 10.1109/TNNLS.2020.2997020. Epub 2023 Apr 4.
9
Text Sentiment Classification Based on BERT Embedding and Sliced Multi-Head Self-Attention Bi-GRU.基于 BERT 嵌入和切片多头自注意力 Bi-GRU 的文本情感分类
Sensors (Basel). 2023 Jan 28;23(3):1481. doi: 10.3390/s23031481.
10
Adaptive Cross-Modal Transferable Adversarial Attacks From Images to Videos.从图像到视频的自适应跨模态可转移对抗攻击
IEEE Trans Pattern Anal Mach Intell. 2024 May;46(5):3772-3783. doi: 10.1109/TPAMI.2023.3347835. Epub 2024 Apr 3.

引用本文的文献

1
A framework for visualizing and describing city image promotion short video data based on microcube model.
PLoS One. 2025 Apr 10;20(4):e0317883. doi: 10.1371/journal.pone.0317883. eCollection 2025.
2
An Audiovisual Correlation Matching Method Based on Fine-Grained Emotion and Feature Fusion.基于细粒度情感和特征融合的视听相关匹配方法。
Sensors (Basel). 2024 Aug 31;24(17):5681. doi: 10.3390/s24175681.

本文引用的文献

1
Weakening the Dominant Role of Text: CMOSI Dataset and Multimodal Semantic Enhancement Network.削弱文本的主导作用:CMOSI数据集与多模态语义增强网络。
IEEE Trans Neural Netw Learn Syst. 2025 Jan;36(1):222-236. doi: 10.1109/TNNLS.2023.3282953. Epub 2025 Jan 7.
2
TSML: A New Pig Behavior Recognition Method Based on Two-Stream Mutual Learning Network.TSML:基于双流互学习网络的新型猪行为识别方法。
Sensors (Basel). 2023 May 26;23(11):5092. doi: 10.3390/s23115092.
3
Movie Scene Event Extraction with Graph Attention Network Based on Argument Correlation Information.
基于论元关联信息的图注意力网络电影场景事件抽取
Sensors (Basel). 2023 Feb 17;23(4):2285. doi: 10.3390/s23042285.
4
Learning Moiré Pattern Elimination in Both Frequency and Spatial Domains for Image Demoiréing.在频率域和空间域中学习消除莫尔条纹以进行图像去莫尔处理。
Sensors (Basel). 2022 Oct 30;22(21):8322. doi: 10.3390/s22218322.
5
Multi-Scale Attention 3D Convolutional Network for Multimodal Gesture Recognition.用于多模态手势识别的多尺度注意力3D卷积网络
Sensors (Basel). 2022 Mar 21;22(6):2405. doi: 10.3390/s22062405.
6
Differences in the temporal scale of reproductive investment across the slow-fast continuum in a passerine.在雀形目动物的慢-快连续体中,繁殖投资的时间尺度存在差异。
Ecol Lett. 2022 May;25(5):1139-1151. doi: 10.1111/ele.13982. Epub 2022 Mar 2.
7
A Survey on Vision Transformer.视觉Transformer综述
IEEE Trans Pattern Anal Mach Intell. 2023 Jan;45(1):87-110. doi: 10.1109/TPAMI.2022.3152247. Epub 2022 Dec 5.
8
Detecting formal thought disorder by deep contextualized word representations.通过深度语境化的词向量表示检测形式思维障碍。
Psychiatry Res. 2021 Oct;304:114135. doi: 10.1016/j.psychres.2021.114135. Epub 2021 Jul 24.
9
Long-Term Recurrent Convolutional Networks for Visual Recognition and Description.长期递归卷积网络的视觉识别与描述。
IEEE Trans Pattern Anal Mach Intell. 2017 Apr;39(4):677-691. doi: 10.1109/TPAMI.2016.2599174. Epub 2016 Sep 1.
10
3D convolutional neural networks for human action recognition.三维卷积神经网络的人体动作识别。
IEEE Trans Pattern Anal Mach Intell. 2013 Jan;35(1):221-31. doi: 10.1109/TPAMI.2012.59.