用于对话中多模态情感识别的跨模态门控特征增强

Cross-modal gated feature enhancement for multimodal emotion recognition in conversations.

作者信息

Zhao Shiyun, Ren Jinchang, Zhou Xiaojuan

机构信息

Department of Economics and Management, Suzhou Chien-Shiung Institute of Technology, Suzhou, 215411, China.

School of Computer Science, Guangdong Polytechnic Normal University, Guangzhou, 510665, China.

出版信息

Sci Rep. 2025 Aug 16;15(1):30004. doi: 10.1038/s41598-025-11989-6.

DOI:10.1038/s41598-025-11989-6

PMID:40819129

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12357892/

Abstract

Emotion recognition in conversations (ERC), which involves identifying the emotional state of each utterance within a dialogue, plays a vital role in developing empathetic artificial intelligence systems. In practical applications, such as video-based recruitment interviews, customer service, health monitoring, intelligent personal assistants, and online education, ERC can facilitate the analysis of emotional cues, improve decision-making processes, and enhance user interaction and satisfaction. Current multimodal emotion recognition research faces several challenges, such as ineffective emotional information extraction from single modalities, underused complementary features, and inter-modal redundancy. To tackle these issues, this paper introduces a cross-modal gated attention mechanism for emotion recognition. The method extracts and fuses visual, textual, and auditory features to enhance accuracy and stability. A cross-modal guided gating mechanism is designed to strengthen single-modality features and utilize a third modality to improve bimodal feature fusion, boosting multimodal feature representation. Furthermore, a cross-modal distillation loss function is proposed to reduce redundancy and improve feature discrimination. This function employs a dual-supervision mechanism with teacher and student models, ensuring consistency in single-modal, bimodal, and trimodal feature representations. Experimental results on the IEMOCAP and MELD datasets indicate that the proposed method achieves higher accuracy and comparable F1 scores than existing approaches, highlighting its effectiveness in capturing multimodal dependencies and balancing modality contributions.

摘要

对话中的情感识别（ERC），即识别对话中每句话的情感状态，在开发具有同理心的人工智能系统中起着至关重要的作用。在实际应用中，如基于视频的招聘面试、客户服务、健康监测、智能个人助理和在线教育，ERC可以促进对情感线索的分析，改善决策过程，并增强用户互动和满意度。当前的多模态情感识别研究面临着几个挑战，如从单模态中提取情感信息效率低下、互补特征未得到充分利用以及模态间冗余。为了解决这些问题，本文引入了一种用于情感识别的跨模态门控注意力机制。该方法提取并融合视觉、文本和听觉特征，以提高准确性和稳定性。设计了一种跨模态引导门控机制来强化单模态特征，并利用第三种模态来改善双模态特征融合，增强多模态特征表示。此外，还提出了一种跨模态蒸馏损失函数来减少冗余并提高特征辨别力。该函数采用教师和学生模型的双监督机制，确保单模态、双模态和三模态特征表示的一致性。在IEMOCAP和MELD数据集上的实验结果表明，所提出的方法比现有方法具有更高的准确率和可比的F1分数，突出了其在捕捉多模态依赖性和平衡模态贡献方面的有效性。