IEEE Trans Pattern Anal Mach Intell. 2023 Sep;45(9):10745-10759. doi: 10.1109/TPAMI.2023.3263585. Epub 2023 Aug 7.
Recent advances in transformer-based architectures have shown promise in several machine learning tasks. In the audio domain, such architectures have been successfully utilised in the field of speech emotion recognition (SER). However, existing works have not evaluated the influence of model size and pre-training data on downstream performance, and have shown limited attention to generalisation, robustness, fairness, and efficiency. The present contribution conducts a thorough analysis of these aspects on several pre-trained variants of wav2vec 2.0 and HuBERT that we fine-tuned on the dimensions arousal, dominance, and valence of MSP-Podcast, while additionally using IEMOCAP and MOSI to test cross-corpus generalisation. To the best of our knowledge, we obtain the top performance for valence prediction without use of explicit linguistic information, with a concordance correlation coefficient (CCC) of. 638 on MSP-Podcast. Our investigations reveal that transformer-based architectures are more robust compared to a CNN-based baseline and fair with respect to gender groups, but not towards individual speakers. Finally, we show that their success on valence is based on implicit linguistic information, which explains why they perform on-par with recent multimodal approaches that explicitly utilise textual information. To make our findings reproducible, we release the best performing model to the community.
基于变压器的架构的最新进展在多个机器学习任务中表现出了前景。在音频领域,这种架构已经在语音情感识别 (SER) 领域得到了成功应用。然而,现有工作并未评估模型大小和预训练数据对下游性能的影响,并且对泛化、鲁棒性、公平性和效率的关注有限。本研究对我们在 MSP-Podcast 的唤醒度、支配度和效价维度上微调的 wav2vec 2.0 和 HuBERT 的几个预训练变体进行了全面分析,同时还使用 IEMOCAP 和 MOSI 来测试跨语料库的泛化能力。据我们所知,我们在不使用显式语言信息的情况下获得了最佳的效价预测性能,在 MSP-Podcast 上的一致性相关系数 (CCC) 为.638。我们的研究表明,与基于 CNN 的基线相比,基于变压器的架构更具鲁棒性,并且在性别群体方面是公平的,但对于单个说话者则不是。最后,我们表明它们在效价上的成功基于隐式语言信息,这解释了为什么它们与最近利用显式文本信息的多模态方法表现相当。为了使我们的发现具有可重复性,我们将表现最佳的模型发布给社区。