Suppr超能文献

VisdaNet:用于多模态情感分类的视觉蒸馏与注意力网络

VisdaNet: Visual Distillation and Attention Network for Multimodal Sentiment Classification.

作者信息

Hou Shangwu, Tuerhong Gulanbaier, Wushouer Mairidan

机构信息

Xinjiang Multilingual Information Technology Laboratory, Xinjiang Multilingual Information Technology Research Center, College of Information Science and Engineering, Xinjiang University, Urumqi 830017, China.

出版信息

Sensors (Basel). 2023 Jan 6;23(2):661. doi: 10.3390/s23020661.

Abstract

Sentiment classification is a key task in exploring people's opinions; improved sentiment classification can help individuals make better decisions. Social media users are increasingly using both images and text to express their opinions and share their experiences, instead of only using text in conventional social media. As a result, understanding how to fully utilize them is critical in a variety of activities, including sentiment classification. In this work, we provide a fresh multimodal sentiment classification approach: visual distillation and attention network or VisdaNet. First, this method proposes a knowledge augmentation module, which overcomes the lack of information in short text by integrating the information of image captions and short text; secondly, aimed at the information control problem in the multi-modal fusion process in the product review scene, this paper proposes a knowledge distillation based on the CLIP module to reduce the noise information of the original modalities and improve the quality of the original modal information. Finally, regarding the single-text multi-image fusion problem in the product review scene, this paper proposes visual aspect attention based on the CLIP module, which correctly models the text-image interaction relationship in special scenes and realizes feature-level fusion across modalities. The results of the experiment on the Yelp multimodal dataset reveal that our model outperforms the previous SOTA model. Furthermore, the ablation experiment results demonstrate the efficacy of various tactics in the suggested model.

摘要

情感分类是探索人们观点的一项关键任务;改进的情感分类有助于个人做出更好的决策。社交媒体用户越来越多地使用图像和文本相结合的方式来表达观点和分享经历,而不再像传统社交媒体那样仅使用文本。因此,了解如何充分利用它们在包括情感分类在内的各种活动中至关重要。在这项工作中,我们提出了一种全新的多模态情感分类方法:视觉蒸馏与注意力网络(VisdaNet)。首先,该方法提出了一个知识增强模块,通过整合图像字幕和短文本的信息来克服短文本中信息不足的问题;其次,针对产品评论场景中多模态融合过程中的信息控制问题,本文提出了基于CLIP模块的知识蒸馏方法,以减少原始模态的噪声信息并提高原始模态信息的质量。最后,针对产品评论场景中的单文本多图像融合问题,本文提出了基于CLIP模块的视觉方面注意力方法,该方法在特殊场景中正确地对文本 - 图像交互关系进行建模,并实现跨模态的特征级融合。在Yelp多模态数据集上的实验结果表明,我们的模型优于先前的最优模型。此外,消融实验结果证明了所提模型中各种策略的有效性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c722/9862286/ad7974c40e77/sensors-23-00661-g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验