Department of Robotics and Mechatronics, Romanian Academy Institute of Solid Mechanics, 010141 Bucharest, Romania.
Sensors (Basel). 2020 Apr 23;20(8):2393. doi: 10.3390/s20082393.
The interaction between humans and an NAO robot using deep convolutional neural networks (CNN) is presented in this paper based on an innovative end-to-end pipeline method that applies two optimized CNNs, one for face recognition (FR) and another one for the facial expression recognition (FER) in order to obtain real-time inference speed for the entire process. Two different models for FR are considered, one known to be very accurate, but has low inference speed (faster region-based convolutional neural network), and one that is not as accurate but has high inference speed (single shot detector convolutional neural network). For emotion recognition transfer learning and fine-tuning of three CNN models (VGG, Inception V3 and ResNet) has been used. The overall results show that single shot detector convolutional neural network (SSD CNN) and faster region-based convolutional neural network (Faster R-CNN) models for face detection share almost the same accuracy: 97.8% for Faster R-CNN on PASCAL visual object classes (PASCAL VOCs) evaluation metrics and 97.42% for SSD Inception. In terms of FER, ResNet obtained the highest training accuracy (90.14%), while the visual geometry group (VGG) network had 87% accuracy and Inception V3 reached 81%. The results show improvements over 10% when using two serialized CNN, instead of using only the FER CNN, while the recent optimization model, called rectified adaptive moment optimization (RAdam), lead to a better generalization and accuracy improvement of 3%-4% on each emotion recognition CNN.
本文提出了一种基于创新端到端管道方法的人类与 NAO 机器人之间的深度卷积神经网络(CNN)交互,该方法应用了两个经过优化的 CNN,一个用于人脸识别(FR),另一个用于面部表情识别(FER),以实现整个过程的实时推理速度。考虑了两种不同的 FR 模型,一种已知非常准确,但推理速度低(更快的基于区域的卷积神经网络),另一种不那么准确,但推理速度高(单次检测卷积神经网络)。对于情绪识别,已经使用了三种 CNN 模型(VGG、Inception V3 和 ResNet)的迁移学习和微调。总体结果表明,单次检测卷积神经网络(SSD CNN)和更快的基于区域的卷积神经网络(Faster R-CNN)在人脸检测方面的准确率几乎相同:在 PASCAL 视觉对象类别(PASCAL VOCs)评估指标上,Faster R-CNN 的准确率为 97.8%,而 SSD Inception 的准确率为 97.42%。在 FER 方面,ResNet 获得了最高的训练准确率(90.14%),而视觉几何组(VGG)网络的准确率为 87%,Inception V3 的准确率为 81%。结果表明,使用两个串联的 CNN 可以提高 10%以上的准确率,而不是仅使用 FER CNN,而最近的优化模型称为修正自适应矩优化(RAdam),可以提高每个情绪识别 CNN 的泛化能力和准确率提高 3%-4%。