Pang Nuo, Guo Songlin, Yan Ming, Chan Chien Aun
School of Design, Dalian University of Science and Technology, Dalian 116052, China.
School of Information and Communications Engineering, Communication University of China, Beijing 100024, China.
Sensors (Basel). 2023 Oct 12;23(20):8425. doi: 10.3390/s23208425.
The explosive growth of online short videos has brought great challenges to the efficient management of video content classification, retrieval, and recommendation. Video features for video management can be extracted from video image frames by various algorithms, and they have been proven to be effective in the video classification of sensor systems. However, frame-by-frame processing of video image frames not only requires huge computing power, but also classification algorithms based on a single modality of video features cannot meet the accuracy requirements in specific scenarios. In response to these concerns, we introduce a short video categorization architecture centered around cross-modal fusion in visual sensor systems which jointly utilizes video features and text features to classify short videos, avoiding processing a large number of image frames during classification. Firstly, the image space is extended to three-dimensional space-time by a self-attention mechanism, and a series of patches are extracted from a single image frame. Each patch is linearly mapped into the embedding layer of the Timesformer network and augmented with positional information to extract video features. Second, the text features of subtitles are extracted through the bidirectional encoder representation from the Transformers (BERT) pre-training model. Finally, cross-modal fusion is performed based on the extracted video and text features, resulting in improved accuracy for short video classification tasks. The outcomes of our experiments showcase a substantial superiority of our introduced classification framework compared to alternative baseline video classification methodologies. This framework can be applied in sensor systems for potential video classification.
在线短视频的爆炸式增长给视频内容分类、检索和推荐的高效管理带来了巨大挑战。用于视频管理的视频特征可以通过各种算法从视频图像帧中提取,并且它们已被证明在传感器系统的视频分类中是有效的。然而,对视频图像帧进行逐帧处理不仅需要巨大的计算能力,而且基于单一视频特征模态的分类算法在特定场景下无法满足精度要求。针对这些问题,我们引入了一种以视觉传感器系统中的跨模态融合为核心的短视频分类架构,该架构联合利用视频特征和文本特征对短视频进行分类,避免在分类过程中处理大量图像帧。首先,通过自注意力机制将图像空间扩展到三维时空,并从单个图像帧中提取一系列图像块。每个图像块被线性映射到Timesformer网络的嵌入层,并添加位置信息以提取视频特征。其次,通过双向编码器表征从变换器(BERT)预训练模型中提取字幕的文本特征。最后,基于提取的视频和文本特征进行跨模态融合,从而提高短视频分类任务的准确率。我们的实验结果表明,与其他基线视频分类方法相比,我们引入的分类框架具有显著优势。该框架可应用于传感器系统中进行潜在的视频分类。