Nimitsurachat Peranut, Washington Peter
Institute for Computational and Mathematical Engineering (ICME), Stanford University, Stanford, CA 94305, USA.
Information and Computer Sciences, University of Hawai'i at Mānoa, Honolulu, HI 96822, USA.
AI (Basel). 2024 Mar;5(1):195-207. doi: 10.3390/ai5010011. Epub 2024 Jan 17.
Emotion recognition models using audio input data can enable the development of interactive systems with applications in mental healthcare, marketing, gaming, and social media analysis. While the field of affective computing using audio data is rich, a major barrier to achieve consistently high-performance models is the paucity of available training labels. Self-supervised learning (SSL) is a family of methods which can learn despite a scarcity of supervised labels by predicting properties of the data itself. To understand the utility of self-supervised learning for audio-based emotion recognition, we have applied self-supervised learning pre-training to the classification of emotions from the CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU- MOSEI)'s acoustic data. Unlike prior papers that have experimented with raw acoustic data, our technique has been applied to encoded acoustic data with 74 parameters of distinctive audio features at discrete timesteps. Our model is first pre-trained to uncover the randomly masked timestamps of the acoustic data. The pre-trained model is then fine-tuned using a small sample of annotated data. The performance of the final model is then evaluated via overall mean absolute error (MAE), mean absolute error (MAE) per emotion, overall four-class accuracy, and four-class accuracy per emotion. These metrics are compared against a baseline deep learning model with an identical backbone architecture. We find that self-supervised learning consistently improves the performance of the model across all metrics, especially when the number of annotated data points in the fine-tuning step is small. Furthermore, we quantify the behaviors of the self-supervised model and its convergence as the amount of annotated data increases. This work characterizes the utility of self-supervised learning for affective computing, demonstrating that self-supervised learning is most useful when the number of training examples is small and that the effect is most pronounced for emotions which are easier to classify such as happy, sad, and angry. This work further demonstrates that self-supervised learning still improves performance when applied to the embedded feature representations rather than the traditional approach of pre-training on the raw input space.
使用音频输入数据的情感识别模型能够推动交互式系统的发展,这些系统可应用于心理保健、市场营销、游戏和社交媒体分析等领域。虽然利用音频数据的情感计算领域资源丰富,但要实现始终保持高性能的模型,一个主要障碍是可用训练标签的匮乏。自监督学习(SSL)是一类方法,它可以通过预测数据本身的属性,在缺乏监督标签的情况下进行学习。为了了解自监督学习在基于音频的情感识别中的效用,我们将自监督学习预训练应用于从卡内基梅隆大学多模态意见情感与情感强度(CMU - MOSEI)声学数据中进行情感分类。与之前使用原始声学数据进行实验的论文不同,我们的技术已应用于在离散时间步长具有74个独特音频特征参数的编码声学数据。我们的模型首先进行预训练,以揭示声学数据中随机掩码的时间戳。然后使用一小部分带注释的数据对预训练模型进行微调。最后通过总体平均绝对误差(MAE)、每种情感的平均绝对误差(MAE)、总体四类准确率以及每种情感的四类准确率来评估最终模型的性能。将这些指标与具有相同骨干架构的基线深度学习模型进行比较。我们发现,自监督学习在所有指标上都能持续提高模型的性能,特别是在微调步骤中带注释数据点数量较少时。此外,我们还量化了随着带注释数据量的增加,自监督模型的行为及其收敛情况。这项工作描述了自监督学习在情感计算中的效用,表明当训练示例数量较少时,自监督学习最为有用,并且对于诸如快乐、悲伤和愤怒等更容易分类的情感,效果最为明显。这项工作进一步表明,当应用于嵌入特征表示时,自监督学习仍然可以提高性能,而不是采用在原始输入空间上进行预训练的传统方法。