AI R&D Team, Kakao Enterprise, 235, Pangyoyeok-ro, Bundang-gu, Seongnam-si, Gyeonggi-do 13494, Korea.
School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, 123 Cheomdan-gwagiro, Buk-gu, Gwangju 61005, Korea.
Sensors (Basel). 2020 May 4;20(9):2614. doi: 10.3390/s20092614.
In this paper, we propose a novel emotion recognition method based on the underlying emotional characteristics extracted from a conditional adversarial auto-encoder (CAAE), in which both acoustic and lexical features are used as inputs. The acoustic features are generated by calculating statistical functionals of low-level descriptors and by a deep neural network (DNN). These acoustic features are concatenated with three types of lexical features extracted from the text, which are a sparse representation, a distributed representation, and an affective lexicon-based dimensions. Two-dimensional latent representations similar to vectors in the valence-arousal space are obtained by a CAAE, which can be directly mapped into the emotional classes without the need for a sophisticated classifier. In contrast to the previous attempt to a CAAE using only acoustic features, the proposed approach could enhance the performance of the emotion recognition because combined acoustic and lexical features provide enough discriminant power. Experimental results on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) corpus showed that our method outperformed the previously reported best results on the same corpus, achieving 76.72% in the unweighted average recall.
在本文中,我们提出了一种新的情感识别方法,该方法基于从条件对抗自动编码器(CAAE)中提取的潜在情感特征,其中同时使用了声学和词汇特征作为输入。声学特征是通过计算低层描述符的统计函数和深度神经网络(DNN)生成的。这些声学特征与从文本中提取的三种类型的词汇特征(稀疏表示、分布式表示和基于情感词典的维度)组合在一起。通过 CAAE 获得类似于效价唤醒空间中向量的二维潜在表示,无需复杂的分类器即可直接将其映射到情感类别。与之前仅使用声学特征的 CAAE 尝试相比,由于组合的声学和词汇特征提供了足够的判别能力,因此所提出的方法可以提高情感识别的性能。在交互式情感对偶运动捕捉(IEMOCAP)语料库上的实验结果表明,与同一语料库上之前报告的最佳结果相比,我们的方法表现出色,在未加权平均召回率方面达到了 76.72%。