Shah Miraj, Cooper David G, Cao Houwei, Gur Ruben C, Nenkova Ani, Verma Ragini
Section of Biomedical Image Analysis, Department of Radiology, University of Pennsylvania, Philadelphia, PA19104, United States.
Department of Psychiatry, University of Pennsylvania, Philadelphia, PA19104, United States.
Int Conf Affect Comput Intell Interact Workshops. 2013 Sep;2013:49-54. doi: 10.1109/ACII.2013.15.
Automatic recognition of emotion using facial expressions in the presence of speech poses a unique challenge because talking reveals clues for the affective state of the speaker but distorts the canonical expression of emotion on the face. We introduce a corpus of acted emotion expression where speech is either present (talking) or absent (silent). The corpus is uniquely suited for analysis of the interplay between the two conditions. We use a multimodal decision level fusion classifier to combine models of emotion from talking and silent faces as well as from audio to recognize five basic emotions: anger, disgust, fear, happy and sad. Our results strongly indicate that emotion prediction in the presence of speech from action unit facial features is less accurate when the person is talking. Modeling talking and silent expressions separately and fusing the two models greatly improves accuracy of prediction in the talking setting. The advantages are most pronounced when silent and talking face models are fused with predictions from audio features. In this multi-modal prediction both the combination of modalities and the separate models of talking and silent facial expression of emotion contribute to the improvement.
在存在语音的情况下利用面部表情自动识别情绪带来了独特的挑战,因为说话会揭示说话者情感状态的线索,但会扭曲脸上情绪的典型表达。我们引入了一个表演情绪表达语料库,其中语音要么存在(说话)要么不存在(沉默)。该语料库特别适合分析这两种情况之间的相互作用。我们使用多模态决策级融合分类器来结合来自说话和沉默面部以及音频的情绪模型,以识别五种基本情绪:愤怒、厌恶、恐惧、快乐和悲伤。我们的结果有力地表明,当人在说话时,基于动作单元面部特征的语音存在时的情绪预测不太准确。分别对说话和沉默表情进行建模并融合这两个模型,大大提高了说话场景下的预测准确性。当沉默和说话面部模型与音频特征的预测融合时,优势最为明显。在这种多模态预测中,模态的组合以及说话和沉默面部情绪表达的单独模型都有助于提高准确性。