基于时空和时频金字塔模型的自然场景上下文感知情感识别。

Context-Aware Emotion Recognition in the Wild Using Spatio-Temporal and Temporal-Pyramid Models.

机构信息

Department of Artificial Intelligence Convergence, Chonnam National University, 77 Yongbong-ro, Gwangju 500-757, Korea.

School of Technology, Environment and Design, University of Tasmania, Hobart, TAS 7001, Australia.

出版信息

Sensors (Basel). 2021 Mar 27;21(7):2344. doi: 10.3390/s21072344.

DOI:10.3390/s21072344

PMID:33801739

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8036494/

Abstract

Emotion recognition plays an important role in human-computer interactions. Recent studies have focused on video emotion recognition in the wild and have run into difficulties related to occlusion, illumination, complex behavior over time, and auditory cues. State-of-the-art methods use multiple modalities, such as frame-level, spatiotemporal, and audio approaches. However, such methods have difficulties in exploiting long-term dependencies in temporal information, capturing contextual information, and integrating multi-modal information. In this paper, we introduce a multi-modal flexible system for video-based emotion recognition in the wild. Our system tracks and votes on significant faces corresponding to persons of interest in a video to classify seven basic emotions. The key contribution of this study is that it proposes the use of face feature extraction with context-aware and statistical information for emotion recognition. We also build two model architectures to effectively exploit long-term dependencies in temporal information with a temporal-pyramid model and a spatiotemporal model with "Conv2D+LSTM+3DCNN+Classify" architecture. Finally, we propose the best selection ensemble to improve the accuracy of multi-modal fusion. The best selection ensemble selects the best combination from spatiotemporal and temporal-pyramid models to achieve the best accuracy for classifying the seven basic emotions. In our experiment, we take benchmark measurement on the AFEW dataset with high accuracy.

摘要

情感识别在人机交互中起着重要作用。最近的研究集中在野外视频情感识别上，遇到了遮挡、光照、随时间变化的复杂行为以及听觉线索等方面的困难。最新的方法使用多种模态，如帧级、时空和音频方法。然而，这些方法在利用时间信息中的长期依赖关系、捕获上下文信息以及整合多模态信息方面存在困难。在本文中，我们引入了一个用于野外视频情感识别的多模态灵活系统。我们的系统跟踪和投票给视频中感兴趣的人的重要面部，以对七种基本情绪进行分类。本研究的主要贡献在于提出了使用上下文感知和统计信息的人脸特征提取进行情感识别。我们还构建了两个模型架构，以有效地利用时间信息中的长期依赖关系，使用时间金字塔模型和具有“Conv2D+LSTM+3DCNN+Classify”架构的时空模型。最后，我们提出了最佳选择集成来提高多模态融合的准确性。最佳选择集成从时空和时间金字塔模型中选择最佳组合，以实现对七种基本情绪的最佳分类精度。在我们的实验中，我们在具有高精度的 AFEW 数据集上进行了基准测试。