Zhao Xuefeng, Wang Yuxiang, Zhong Zhaoman
School of Computer Engineering, Jiangsu Ocean University, Lianyungang 222005, China.
Sensors (Basel). 2025 Apr 17;25(8):2553. doi: 10.3390/s25082553.
The rapid development of social media has driven the need for opinion mining and sentiment analysis based on multimodal samples. As a fine-grained task within multimodal sentiment analysis, aspect-based multimodal sentiment analysis (ABMSA) enables the accurate and efficient determination of sentiment polarity for aspect-level targets. However, traditional ABMSA methods often perform suboptimally on social media samples, as the images in these samples typically contain embedded text that conventional models overlook. Such text influences sentiment judgment. To address this issue, we propose a text-in-image enhanced self-supervised alignment model (TESAM) that accounts for multimodal information more comprehensively. Specifically, we employed Optical Character Recognition technology to extract embedded text from images and, based on the principle that text-in-image is an integral part of the visual modality, fused it with visual features to obtain more comprehensive image representations. Additionally, we incorporate aspect words to guide the model in disregarding irrelevant semantic features, thereby reducing noise interference. Furthermore, to mitigate the semantic gap between modalities, we propose pre-training the feature extraction module with self-supervised alignment. During this pre-training stage, unimodal semantic embeddings from both modalities are aligned by calculating errors using Euclidean distance and cosine similarity. Experimental results demonstrate that TESAM achieved remarkable performances on three ABMSA benchmarks. These results validate the rationale and effectiveness of our proposed improvements.
社交媒体的快速发展推动了基于多模态样本的观点挖掘和情感分析的需求。作为多模态情感分析中的一项细粒度任务,基于方面的多模态情感分析(ABMSA)能够准确、高效地确定方面级目标的情感极性。然而,传统的ABMSA方法在社交媒体样本上的表现往往不尽人意,因为这些样本中的图像通常包含传统模型忽略的嵌入式文本。此类文本会影响情感判断。为了解决这个问题,我们提出了一种文本嵌入图像增强的自监督对齐模型(TESAM),该模型能更全面地考虑多模态信息。具体而言,我们采用光学字符识别技术从图像中提取嵌入式文本,并基于图像中的文本是视觉模态不可或缺的一部分这一原则,将其与视觉特征融合,以获得更全面的图像表示。此外,我们纳入方面词来引导模型忽略不相关的语义特征,从而减少噪声干扰。此外,为了缩小模态之间的语义差距,我们提出使用自监督对齐对特征提取模块进行预训练。在这个预训练阶段,通过使用欧几里得距离和余弦相似度计算误差,对来自两种模态的单模态语义嵌入进行对齐。实验结果表明,TESAM在三个ABMSA基准测试中取得了显著的性能。这些结果验证了我们提出的改进的合理性和有效性。