Suppr超能文献

通过应用视觉-语言预训练模型对恶意模因进行多模态检测。

Multimodal detection of hateful memes by applying a vision-language pre-training model.

机构信息

Putnam Science Academy, Putnam, CT, United States of America.

Department of Radiology, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China.

出版信息

PLoS One. 2022 Sep 12;17(9):e0274300. doi: 10.1371/journal.pone.0274300. eCollection 2022.

Abstract

Detrimental to individuals and society, online hateful messages have recently become a major social issue. Among them, one new type of hateful message, "hateful meme", has emerged and brought difficulties in traditional deep learning-based detection. Because hateful memes were formatted with both text captions and images to express users' intents, they cannot be accurately identified by singularly analyzing embedded text captions or images. In order to effectively detect a hateful meme, the algorithm must possess strong vision and language fusion capability. In this study, we move closer to this goal by feeding a triplet by stacking the visual features, object tags, and text features of memes generated by the object detection model named Visual features in Vision-Language (VinVl) and the optical character recognition (OCR) technology into a Transformer-based Vision-Language Pre-Training Model (VL-PTM) OSCAR+ to perform the cross-modal learning of memes. After fine-tuning and connecting to a random forest (RF) classifier, our model (OSCAR+RF) achieved an average accuracy and AUROC of 0.684 and 0.768, respectively, on the hateful meme detection task in a public test set, which was higher than the other eleven (11) published baselines. In conclusion, this study has demonstrated that VL-PTMs with the addition of anchor points can improve the performance of deep learning-based detection of hateful memes by involving a more substantial alignment between the text caption and visual information.

摘要

有害于个人和社会的网络仇恨信息最近已成为一个主要的社会问题。在这些仇恨信息中,出现了一种新型的仇恨信息,即“仇恨模因”,这给基于传统深度学习的检测带来了困难。因为仇恨模因是通过文本标题和图像组合而成的,以表达用户的意图,所以仅通过分析嵌入式文本标题或图像无法准确识别。为了有效地检测仇恨模因,该算法必须具备强大的视觉和语言融合能力。在这项研究中,我们通过将由名为 Visual features in Vision-Language (VinVl) 的对象检测模型生成的视觉特征、对象标签和文本特征的三元组馈送到基于 Transformer 的 Vision-Language 预训练模型 (VL-PTM) OSCAR+中,来实现这一目标,从而对模因进行跨模态学习。在对其进行微调并连接到随机森林 (RF) 分类器后,我们的模型(OSCAR+RF)在公共测试集上的仇恨模因检测任务中的平均准确率和 AUROC 分别达到 0.684 和 0.768,高于其他 11 个已发布的基线。总之,这项研究表明,通过增加锚点,添加了锚点的 VL-PTM 可以通过在文本标题和视觉信息之间进行更充分的对齐,提高基于深度学习的仇恨模因检测的性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7d18/9467312/d18b2ba13dd3/pone.0274300.g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验