Artificial Intelligence Research Laboratory, Electronics and Telecommunications Research Institute, Daejeon 34129, Korea.
Sensors (Basel). 2021 Feb 24;21(5):1579. doi: 10.3390/s21051579.
Speech emotion recognition (SER) is a natural method of recognizing individual emotions in everyday life. To distribute SER models to real-world applications, some key challenges must be overcome, such as the lack of datasets tagged with emotion labels and the weak generalization of the SER model for an unseen target domain. This study proposes a multi-path and group-loss-based network (MPGLN) for SER to support multi-domain adaptation. The proposed model includes a bidirectional long short-term memory-based temporal feature generator and a transferred feature extractor from the pre-trained VGG-like audio classification model (VGGish), and it learns simultaneously based on multiple losses according to the association of emotion labels in the discrete and dimensional models. For the evaluation of the MPGLN SER as applied to multi-cultural domain datasets, the Korean Emotional Speech Database (KESD), including KESDy18 and KESDy19, is constructed, and the English-speaking Interactive Emotional Dyadic Motion Capture database (IEMOCAP) is used. The evaluation of multi-domain adaptation and domain generalization showed 3.7% and 3.5% improvements, respectively, of the F1 score when comparing the performance of MPGLN SER with a baseline SER model that uses a temporal feature generator. We show that the MPGLN SER efficiently supports multi-domain adaptation and reinforces model generalization.
语音情感识别(SER)是一种识别日常生活中个体情感的自然方法。为了将 SER 模型应用于实际应用中,必须克服一些关键挑战,例如缺乏带有情感标签的数据集以及 SER 模型对未见目标域的弱泛化能力。本研究提出了一种基于多路径和组损失的网络(MPGLN)用于 SER,以支持多域自适应。所提出的模型包括基于双向长短期记忆的时间特征生成器和从预训练的 VGG 类似音频分类模型(VGGish)转移的特征提取器,它根据离散和维度模型中情感标签的关联,根据多个损失同时进行学习。为了评估 MPGLN SER 在多文化域数据集上的应用,构建了包括 KESDy18 和 KESDy19 的韩国情感语音数据库(KESD),并使用了英语交互情感双模态运动捕捉数据库(IEMOCAP)。多域自适应和域泛化的评估结果表明,与使用时间特征生成器的基线 SER 模型相比,MPGLN SER 的 F1 分数分别提高了 3.7%和 3.5%。我们表明,MPGLN SER 能够有效地支持多域自适应,并增强模型的泛化能力。