School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, Durban, 4001, South Africa.
Sci Rep. 2024 Jun 7;14(1):13126. doi: 10.1038/s41598-024-63776-4.
In human-computer interaction systems, speech emotion recognition (SER) plays a crucial role because it enables computers to understand and react to users' emotions. In the past, SER has significantly emphasised acoustic properties extracted from speech signals. The use of visual signals for enhancing SER performance, however, has been made possible by recent developments in deep learning and computer vision. This work utilizes a lightweight Vision Transformer (ViT) model to propose a novel method for improving speech emotion recognition. We leverage the ViT model's capabilities to capture spatial dependencies and high-level features in images which are adequate indicators of emotional states from mel spectrogram input fed into the model. To determine the efficiency of our proposed approach, we conduct a comprehensive experiment on two benchmark speech emotion datasets, the Toronto English Speech Set (TESS) and the Berlin Emotional Database (EMODB). The results of our extensive experiment demonstrate a considerable improvement in speech emotion recognition accuracy attesting to its generalizability as it achieved 98%, 91%, and 93% (TESS-EMODB) accuracy respectively on the datasets. The outcomes of the comparative experiment show that the non-overlapping patch-based feature extraction method substantially improves the discipline of speech emotion recognition. Our research indicates the potential for integrating vision transformer models into SER systems, opening up fresh opportunities for real-world applications requiring accurate emotion recognition from speech compared with other state-of-the-art techniques.
在人机交互系统中,语音情感识别(SER)起着至关重要的作用,因为它使计算机能够理解和响应用户的情感。过去,SER 非常强调从语音信号中提取的声学特性。然而,由于深度学习和计算机视觉的最新发展,利用视觉信号来提高 SER 性能成为可能。本工作利用轻量级 Vision Transformer(ViT)模型提出了一种改进语音情感识别的新方法。我们利用 ViT 模型的能力来捕捉图像中的空间依赖关系和高级特征,这些特征足以从输入模型的梅尔频谱图中表示情感状态。为了确定我们提出的方法的效率,我们在两个基准语音情感数据集 Toronto English Speech Set (TESS) 和 Berlin Emotional Database (EMODB) 上进行了全面的实验。我们广泛实验的结果表明,在语音情感识别准确性方面取得了显著提高,证明了其通用性,因为它在数据集上分别达到了 98%、91%和 93%(TESS-EMODB)的准确性。对比实验的结果表明,基于非重叠补丁的特征提取方法极大地提高了语音情感识别领域的水平。我们的研究表明,将 Vision Transformer 模型集成到 SER 系统中具有潜力,与其他最先进的技术相比,这为需要从语音中准确识别情感的实际应用开辟了新的机会。