Suppr超能文献

基于视觉场景感知的混合多模态特征聚合的面部表情识别。

Visual Scene-Aware Hybrid and Multi-Modal Feature Aggregation for Facial Expression Recognition.

机构信息

Department of Electronic Engineering, Inha University, 100 Inha-ro, Michuhol-gu, Incheon 22212, Korea.

出版信息

Sensors (Basel). 2020 Sep 11;20(18):5184. doi: 10.3390/s20185184.

Abstract

Facial expression recognition (FER) technology has made considerable progress with the rapid development of deep learning. However, conventional FER techniques are mainly designed and trained for videos that are artificially acquired in a limited environment, so they may not operate robustly on videos acquired in a wild environment suffering from varying illuminations and head poses. In order to solve this problem and improve the ultimate performance of FER, this paper proposes a new architecture that extends a state-of-the-art FER scheme and a multi-modal neural network that can effectively fuse image and landmark information. To this end, we propose three methods. To maximize the performance of the recurrent neural network (RNN) in the previous scheme, we first propose a frame substitution module that replaces the latent features of less important frames with those of important frames based on inter-frame correlation. Second, we propose a method for extracting facial landmark features based on the correlation between frames. Third, we propose a new multi-modal fusion method that effectively fuses video and facial landmark information at the feature level. By applying attention based on the characteristics of each modality to the features of the modality, novel fusion is achieved. Experimental results show that the proposed method provides remarkable performance, with 51.4% accuracy for the wild AFEW dataset, 98.5% accuracy for the CK+ dataset and 81.9% accuracy for the MMI dataset, outperforming the state-of-the-art networks.

摘要

面部表情识别 (FER) 技术随着深度学习的飞速发展取得了相当大的进展。然而,传统的 FER 技术主要是为在有限环境中人工获取的视频而设计和训练的,因此它们在因光照和头部姿势变化而导致的野外环境获取的视频上可能无法稳健运行。为了解决这个问题并提高 FER 的最终性能,本文提出了一种新的架构,扩展了一种最先进的 FER 方案和一种多模态神经网络,可以有效地融合图像和地标信息。为此,我们提出了三种方法。为了最大限度地提高先前方案中递归神经网络 (RNN) 的性能,我们首先提出了一种基于帧间相关性的帧替换模块,该模块根据帧间相关性用重要帧的潜在特征替换不太重要帧的潜在特征。其次,我们提出了一种基于帧间相关性的面部地标特征提取方法。第三,我们提出了一种新的多模态融合方法,可以有效地融合视频和面部地标信息的特征级。通过在模态特征上基于各模态的特征应用注意力,实现了新颖的融合。实验结果表明,所提出的方法提供了卓越的性能,在野外 AFEW 数据集上的准确率为 51.4%,在 CK+数据集上的准确率为 98.5%,在 MMI 数据集上的准确率为 81.9%,优于最先进的网络。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/95fd/7571042/c1a0918c8704/sensors-20-05184-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验