Majeed Adil, Mujtaba Hasan
School of Computing, National University of Computer and Emerging Sciences, Islamabad, Islamabad, Punjab, Pakistan.
PeerJ Comput Sci. 2025 May 1;11:e2861. doi: 10.7717/peerj-cs.2861. eCollection 2025.
Emotion detection is a critical component of interaction between human and computer systems, more especially affective computing, and health screening. Integrating video, speech, and text information provides better coverage of the basic and derived affective states with improved estimation of verbal and non-verbal behavior. However, there is a lack of systematic preferences and models for the detection of emotions in low-resource languages such as Urdu. To this effect, we propose Urdu Multimodal Emotion Detection Network (UMEDNet), a new emotion detection model for Urdu that works with video, speech, and text inputs for a better understanding of emotion. To support our proposed UMEDNet, we created the Urdu Multimodal Emotion Detection (UMED) , which is a seventeen-hour annotated of five basic emotions. To the best of our knowledge, the current study provides the first for detecting emotion in the context of multimodal emotion detection for the Urdu language and is extensible for extended research. UMEDNet leverages state-of-the-art techniques for feature extraction across modalities; for extracting facial features from video, both Multi-task Cascaded Convolutional Networks (MTCNN) and FaceNet were used with fine-tuned Wav2Vec2 for speech features and XLM-Roberta for text. These features are then projected into common latent spaces to enable the effective fusion of multimodal data and to enhance the accuracy of emotion prediction. The model demonstrates strong performance, achieving an overall accuracy of 85.27%, while precision, recall, and F1 scores, are all approximately equivalent. In the end, we analyzed the impact of UMEDNet and found that our model integrates data on different modalities and leads to better performance.
情感检测是人机系统交互的关键组成部分,在情感计算和健康筛查中尤为重要。整合视频、语音和文本信息能更好地涵盖基本情感状态和派生情感状态,同时改进对言语和非言语行为的估计。然而,对于乌尔都语等资源匮乏语言的情感检测,缺乏系统的偏好和模型。为此,我们提出了乌尔都语多模态情感检测网络(UMEDNet),这是一种用于乌尔都语的新型情感检测模型,它通过视频、语音和文本输入来更好地理解情感。为了支持我们提出的UMEDNet,我们创建了乌尔都语多模态情感检测数据集(UMED),它是一个包含五种基本情感的17小时标注数据集。据我们所知,当前的研究首次提供了在乌尔都语多模态情感检测背景下检测情感的方法,并且可扩展用于进一步的研究。UMEDNet利用跨模态的先进技术进行特征提取;从视频中提取面部特征时,同时使用了多任务级联卷积网络(MTCNN)和FaceNet,语音特征使用了微调后的Wav2Vec2,文本使用了XLM-Roberta。然后将这些特征投影到共同的潜在空间中,以实现多模态数据的有效融合并提高情感预测的准确性。该模型表现出色,总体准确率达到85.27%,而精确率、召回率和F1分数大致相当。最后,我们分析了UMEDNet的影响,发现我们的模型整合了不同模态的数据并带来了更好的性能。