Suppr超能文献

SMaTE:一种用于面部表情识别的分段级特征混合和时间编码框架。

SMaTE: A Segment-Level Feature Mixing and Temporal Encoding Framework for Facial Expression Recognition.

机构信息

Communication and Media Engineering, University of Science and Technology, 217, Gajeong-ro, Yuseong-gu, Daejeon 34113, Korea.

Electronics and Telecommunications Research Institute, 218, Gajeong-ro, Yuseong-gu, Daejeon 34129, Korea.

出版信息

Sensors (Basel). 2022 Aug 1;22(15):5753. doi: 10.3390/s22155753.

Abstract

Despite advanced machine learning methods, the implementation of emotion recognition systems based on real-world video content remains challenging. Videos may contain data such as images, audio, and text. However, the application of multimodal models using two or more types of data to real-world video media (CCTV, illegally filmed content, etc.) lacking sound or subtitles is difficult. Although facial expressions in image sequences can be utilized in emotion recognition, the diverse identities of individuals in real-world content limits computational models of relationships between facial expressions. This study proposed a transformation model which employed a video vision transformer to focus on facial expression sequences in videos. It effectively understood and extracted facial expression information from the identities of individuals, instead of fusing multimodal models. The design entailed capture of higher-quality facial expression information through mixed-token embedding facial expression sequences augmented via various methods into a single data representation, and comprised two modules: spatial and temporal encoders. Further, temporal position embedding, focusing on relationships between video frames, was proposed and subsequently applied to the temporal encoder module. The performance of the proposed algorithm was compared with that of conventional methods on two emotion recognition datasets of video content, with results demonstrating its superiority.

摘要

尽管机器学习方法已经很先进,但基于真实视频内容的情感识别系统的实现仍然具有挑战性。视频可能包含图像、音频和文本等数据。然而,将使用两种或多种类型的数据的多模态模型应用于缺乏声音或字幕的真实世界视频媒体(CCTV、非法拍摄内容等)是困难的。虽然图像序列中的面部表情可用于情感识别,但真实世界内容中个体的多样性限制了面部表情之间关系的计算模型。本研究提出了一种转换模型,该模型使用视频视觉转换器来关注视频中的面部表情序列。它可以有效地理解和提取个体身份的面部表情信息,而不是融合多模态模型。该设计涉及通过各种方法将混合标记嵌入的面部表情序列增强到单个数据表示中,以捕获更高质量的面部表情信息,包括两个模块:空间编码器和时间编码器。此外,还提出了关注视频帧之间关系的时间位置嵌入,并将其应用于时间编码器模块。在两个视频内容情感识别数据集上,将所提出算法的性能与传统方法进行了比较,结果表明其具有优越性。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验