基于修正 Adam 优化器的深度卷积神经网络的人机交互中的面部表情识别。

Facial Expressions Recognition for Human-Robot Interaction Using Deep Convolutional Neural Networks with Rectified Adam Optimizer.

机构信息

Department of Robotics and Mechatronics, Romanian Academy Institute of Solid Mechanics, 010141 Bucharest, Romania.

出版信息

Sensors (Basel). 2020 Apr 23;20(8):2393. doi: 10.3390/s20082393.

DOI:10.3390/s20082393

PMID:32340140

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7219340/

Abstract

The interaction between humans and an NAO robot using deep convolutional neural networks (CNN) is presented in this paper based on an innovative end-to-end pipeline method that applies two optimized CNNs, one for face recognition (FR) and another one for the facial expression recognition (FER) in order to obtain real-time inference speed for the entire process. Two different models for FR are considered, one known to be very accurate, but has low inference speed (faster region-based convolutional neural network), and one that is not as accurate but has high inference speed (single shot detector convolutional neural network). For emotion recognition transfer learning and fine-tuning of three CNN models (VGG, Inception V3 and ResNet) has been used. The overall results show that single shot detector convolutional neural network (SSD CNN) and faster region-based convolutional neural network (Faster R-CNN) models for face detection share almost the same accuracy: 97.8% for Faster R-CNN on PASCAL visual object classes (PASCAL VOCs) evaluation metrics and 97.42% for SSD Inception. In terms of FER, ResNet obtained the highest training accuracy (90.14%), while the visual geometry group (VGG) network had 87% accuracy and Inception V3 reached 81%. The results show improvements over 10% when using two serialized CNN, instead of using only the FER CNN, while the recent optimization model, called rectified adaptive moment optimization (RAdam), lead to a better generalization and accuracy improvement of 3%-4% on each emotion recognition CNN.

摘要

本文提出了一种基于创新端到端管道方法的人类与 NAO 机器人之间的深度卷积神经网络（CNN）交互，该方法应用了两个经过优化的 CNN，一个用于人脸识别（FR），另一个用于面部表情识别（FER），以实现整个过程的实时推理速度。考虑了两种不同的 FR 模型，一种已知非常准确，但推理速度低（更快的基于区域的卷积神经网络），另一种不那么准确，但推理速度高（单次检测卷积神经网络）。对于情绪识别，已经使用了三种 CNN 模型（VGG、Inception V3 和 ResNet）的迁移学习和微调。总体结果表明，单次检测卷积神经网络（SSD CNN）和更快的基于区域的卷积神经网络（Faster R-CNN）在人脸检测方面的准确率几乎相同：在 PASCAL 视觉对象类别（PASCAL VOCs）评估指标上，Faster R-CNN 的准确率为 97.8%，而 SSD Inception 的准确率为 97.42%。在 FER 方面，ResNet 获得了最高的训练准确率（90.14%），而视觉几何组（VGG）网络的准确率为 87%，Inception V3 的准确率为 81%。结果表明，使用两个串联的 CNN 可以提高 10%以上的准确率，而不是仅使用 FER CNN，而最近的优化模型称为修正自适应矩优化（RAdam），可以提高每个情绪识别 CNN 的泛化能力和准确率提高 3%-4%。