Vamsidhar D, Desai Parth, Shahade Aniket K, Patil Shruti, Deshmukh Priyanka V
Symbiosis Institute of Technology, Pune Campus, Symbiosis International (Deemed University), Pune, India.
Sci Rep. 2025 Jul 15;15(1):25440. doi: 10.1038/s41598-025-09000-3.
This paper presents a new architecture for multimodal sentiment analysis exploiting hierarchical cross-modal attention mechanisms, as well as two parallel lanes for audio analysis. Traditional sentiment analysis approaches are mainly based on text data, which can be inefficient as valuable sentiment information may reside within images and audio. Aiming at solving this issue, the model provides a unified framework that integrates three modalities (text, image, audio) based on BERT text encoder, ResNet50 visual features extractor and hybrid CNN-Wav2Vec2.0 pipeline for audio representation. Specifically, its main innovation is a dual audio pathway augmented with a dynamic gating module and a cross-modal self-attention layer that enables fine-grained interaction among modalities. Our model reports state-of-the-art performance on various benchmarks, outperforming recent approaches: CLIP, MISA and MSFNet. Such that, the results reveal an improvement of classification accuracy especially with missing or noisy modality data. The system has robustness and reliability, which is validated with an exhaustive analysis through metrics like precision, recall, F1-score, and confusion-matrices. In addition, such an architecture demonstrates modular scalability and adaptability across domains, making it proficient for applications in healthcare, social media, and customer service. By providing a framework for developing affective AI systems that can decode human emotion from intricate multimodal features, the study lays the groundwork for future research into further processing of such data streams in the longer term, including real-time processing, domain-specific adjustments, and extending the analysis to the addition of multi-channel sensor input combining physiological and temporal data streams.
本文提出了一种用于多模态情感分析的新架构,该架构利用分层跨模态注意力机制以及用于音频分析的两条并行通道。传统的情感分析方法主要基于文本数据,由于有价值的情感信息可能存在于图像和音频中,因此这种方法可能效率不高。为了解决这个问题,该模型提供了一个统一的框架,该框架基于BERT文本编码器、ResNet50视觉特征提取器以及用于音频表示的混合CNN-Wav2Vec2.0管道,集成了三种模态(文本、图像、音频)。具体来说,其主要创新之处在于一个双音频路径,该路径通过动态门控模块和跨模态自注意力层进行增强,从而实现模态之间的细粒度交互。我们的模型在各种基准测试中报告了领先的性能,优于最近的方法:CLIP、MISA和MSFNet。因此,结果显示分类准确率有所提高,尤其是在存在缺失或噪声模态数据的情况下。该系统具有鲁棒性和可靠性,通过诸如精度、召回率、F1分数和混淆矩阵等指标进行的详尽分析得到了验证。此外,这样的架构展示了跨领域的模块化可扩展性和适应性,使其精通于医疗保健、社交媒体和客户服务等应用。通过提供一个用于开发情感人工智能系统的框架,该系统可以从复杂的多模态特征中解码人类情感,这项研究为未来长期进一步处理此类数据流的研究奠定了基础,包括实时处理、特定领域调整以及将分析扩展到添加结合生理和时间数据流的多通道传感器输入。