Keinert Marie, Pistrosch Simon, Mallol-Ragolta Adria, Schuller Björn W, Berking Matthias
Department of Clinical Psychology and Psychotherapy, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany.
Chair of Health Informatics, Technical University of Munich, Munich, Germany.
J Med Internet Res. 2025 Jul 2;27:e68942. doi: 10.2196/68942.
The development of automatic emotion recognition models from smartphone videos is a crucial step toward the dissemination of psychotherapeutic app interventions that encourage emotional expressions. Existing models focus mainly on the 6 basic emotions while neglecting other therapeutically relevant emotions. To support this research, we introduce the novel Stress Reduction Training Through the Recognition of Emotions Wizard-of-Oz (STREs WoZ) dataset, which contains facial videos of 16 distinct, therapeutically relevant emotions.
This study aimed to develop deep learning-based automatic facial emotion recognition (FER) models for binary (positive vs negative) and multiclass emotion classification tasks, assess the models' performance, and validate them by comparing the models with human observers.
The STREs WoZ dataset contains 14,412 facial videos of 63 individuals displaying the 16 emotions. The selfie-style videos were recorded during a stress reduction training using front-facing smartphone cameras in a nonconstrained laboratory setting. Automatic FER models using both appearance and deep-learned features for binary and multiclass emotion classification were trained on the STREs WoZ dataset. The appearance features were based on the Facial Action Coding System and extracted with OpenFace. The deep-learned features were obtained through a ResNet50 model. For our deep learning models, we used the appearance features, the deep-learned features, and their concatenation as inputs. We used 3 recurrent neural network (RNN)-based architectures: RNN-convolution, RNN-attention, and RNN-average networks. For validation, 3 human observers were also trained in binary and multiclass emotion recognition. A test set of 3018 facial emotion videos of the 16 emotions was completed by both the automatic FER model and human observers. The performance was assessed with unweighted average recall (UAR) and accuracy.
Models using appearance features outperformed those using deep-learned features, as well as models combining both feature types in both tasks, with the attention network using appearance features emerging as the best-performing model. The attention network achieved a UAR of 92.9% in the binary classification task, and accuracy values ranged from 59.0% to 90.0% in the multiclass classification task. Human performance was comparable to that of the automatic FER model in the binary classification task, with a UAR of 91.0%, and superior in the multiclass classification task, with accuracy values ranging from 87.4% to 99.8%.
Future studies are needed to enhance the performance of automatic FER models for practical use in psychotherapeutic apps. Nevertheless, this study represents an important first step toward advancing emotion-focused psychotherapeutic interventions via smartphone apps.
从智能手机视频开发自动情绪识别模型是推广鼓励情感表达的心理治疗应用干预措施的关键一步。现有模型主要关注6种基本情绪,而忽略了其他与治疗相关的情绪。为支持这项研究,我们引入了新颖的通过情绪识别绿野仙踪进行减压训练(STREs WoZ)数据集,该数据集包含16种不同的、与治疗相关的情绪的面部视频。
本研究旨在开发基于深度学习的自动面部情绪识别(FER)模型,用于二分类(积极与消极)和多分类情绪分类任务,评估模型的性能,并通过将模型与人类观察者进行比较来验证它们。
STREs WoZ数据集包含63个人的14412个面部视频,展示了16种情绪。这些自拍风格的视频是在减压训练期间,在非受限实验室环境中使用前置智能手机摄像头录制的。使用外观特征和深度学习特征进行二分类和多分类情绪分类的自动FER模型在STREs WoZ数据集上进行训练。外观特征基于面部动作编码系统,并使用OpenFace提取。深度学习特征通过ResNet50模型获得。对于我们的深度学习模型,我们使用外观特征、深度学习特征及其拼接作为输入。我们使用了3种基于循环神经网络(RNN)的架构:RNN-卷积、RNN-注意力和RNN-平均网络。为了进行验证,3名人类观察者也接受了二分类和多分类情绪识别的训练。自动FER模型和人类观察者都完成了一个包含16种情绪的3018个面部情绪视频的测试集。使用未加权平均召回率(UAR)和准确率来评估性能。
在两项任务中,使用外观特征的模型优于使用深度学习特征的模型,以及结合两种特征类型的模型,使用外观特征的注意力网络成为表现最佳的模型。注意力网络在二分类任务中实现了92.9%的UAR,在多分类任务中的准确率值范围为59.0%至90.0%。在二分类任务中,人类表现与自动FER模型相当,UAR为91.0%,在多分类任务中更优,准确率值范围为87.4%至99.8%。
未来需要开展研究以提高自动FER模型在心理治疗应用中的实际性能。尽管如此,本研究是通过智能手机应用推进以情绪为重点的心理治疗干预的重要第一步。