Department of Computer Science and Engineering, Chittagong University of Engineering and Technology, Chittagong 4349, Bangladesh.
School of Computing, Engineering and Intelligent Systems, Ulster University, Belfast BT15 1AP, UK.
Sensors (Basel). 2024 Sep 10;24(18):5862. doi: 10.3390/s24185862.
Multimodal emotion classification (MEC) involves analyzing and identifying human emotions by integrating data from multiple sources, such as audio, video, and text. This approach leverages the complementary strengths of each modality to enhance the accuracy and robustness of emotion recognition systems. However, one significant challenge is effectively integrating these diverse data sources, each with unique characteristics and levels of noise. Additionally, the scarcity of large, annotated multimodal datasets in Bangla limits the training and evaluation of models. In this work, we unveiled a pioneering multimodal Bangla dataset, MAViT-Bangla (Multimodal Audio Video Text Bangla dataset). This dataset, comprising 1002 samples across audio, video, and text modalities, is a unique resource for emotion recognition studies in the Bangla language. It features emotional categories such as anger, fear, joy, and sadness, providing a comprehensive platform for research. Additionally, we developed a framework for audio, video and textual emotion recognition (i.e., AVaTER) that employs a cross-modal attention mechanism among unimodal features. This mechanism fosters the interaction and fusion of features from different modalities, enhancing the model's ability to capture nuanced emotional cues. The effectiveness of this approach was demonstrated by achieving an F1-score of 0.64, a significant improvement over unimodal methods.
多模态情感分类(MEC)涉及通过整合来自多个来源的数据,如音频、视频和文本,来分析和识别人类情感。这种方法利用每个模态的互补优势,提高情感识别系统的准确性和鲁棒性。然而,一个重大的挑战是如何有效地整合这些不同的数据源,每个数据源都具有独特的特征和噪声水平。此外,孟加拉语中缺乏大型、带注释的多模态数据集,限制了模型的训练和评估。在这项工作中,我们揭示了一个开创性的多模态孟加拉语数据集,即 MAViT-Bangla(多模态音频视频文本孟加拉语数据集)。该数据集包含 1002 个样本,涵盖音频、视频和文本模态,是孟加拉语情感识别研究的独特资源。它具有愤怒、恐惧、喜悦和悲伤等情感类别,为研究提供了一个全面的平台。此外,我们开发了一个用于音频、视频和文本情感识别的框架(即 AVaTER),该框架在单模态特征之间采用了跨模态注意力机制。这种机制促进了不同模态之间的特征交互和融合,提高了模型捕捉细微情感线索的能力。通过实现 0.64 的 F1 分数,该方法证明了其有效性,这比单模态方法有了显著的提高。