• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

通用多模态语言理解表示

Universal Multimodal Representation for Language Understanding.

出版信息

IEEE Trans Pattern Anal Mach Intell. 2023 Jul;45(7):9169-9185. doi: 10.1109/TPAMI.2023.3234170. Epub 2023 Jun 5.

DOI:10.1109/TPAMI.2023.3234170
PMID:37018264
Abstract

Representation learning is the foundation of natural language processing (NLP). This work presents new methods to employ visual information as assistant signals to general NLP tasks. For each sentence, we first retrieve a flexible number of images either from a light topic-image lookup table extracted over the existing sentence-image pairs or a shared cross-modal embedding space that is pre-trained on out-of-shelf text-image pairs. Then, the text and images are encoded by a Transformer encoder and convolutional neural network, respectively. The two sequences of representations are further fused by an attention layer for the interaction of the two modalities. In this study, the retrieval process is controllable and flexible. The universal visual representation overcomes the lack of large-scale bilingual sentence-image pairs. Our method can be easily applied to text-only tasks without manually annotated multimodal parallel corpora. We apply the proposed method to a wide range of natural language generation and understanding tasks, including neural machine translation, natural language inference, and semantic similarity. Experimental results show that our method is generally effective for different tasks and languages. Analysis indicates that the visual signals enrich textual representations of content words, provide fine-grained grounding information about the relationship between concepts and events, and potentially conduce to disambiguation.

摘要

表示学习是自然语言处理(NLP)的基础。这项工作提出了新的方法,利用视觉信息作为辅助信号来处理一般的 NLP 任务。对于每个句子,我们首先从现有的句子-图像对中提取的灵活数量的图像或预先在货架外的文本-图像对上训练的共享跨模态嵌入空间中检索图像。然后,文本和图像分别由 Transformer 编码器和卷积神经网络进行编码。两个表示序列通过注意力层进一步融合,以实现两种模式的交互。在这项研究中,检索过程是可控和灵活的。通用的视觉表示克服了缺乏大规模双语句子-图像对的问题。我们的方法可以很容易地应用于仅文本任务,而无需手动注释多模态平行语料库。我们将所提出的方法应用于广泛的自然语言生成和理解任务,包括神经机器翻译、自然语言推理和语义相似性。实验结果表明,我们的方法对于不同的任务和语言通常都是有效的。分析表明,视觉信号丰富了内容词的文本表示,提供了关于概念和事件之间关系的细粒度基础信息,并有助于消除歧义。

相似文献

1
Universal Multimodal Representation for Language Understanding.通用多模态语言理解表示
IEEE Trans Pattern Anal Mach Intell. 2023 Jul;45(7):9169-9185. doi: 10.1109/TPAMI.2023.3234170. Epub 2023 Jun 5.
2
Predicting Semantic Similarity Between Clinical Sentence Pairs Using Transformer Models: Evaluation and Representational Analysis.使用Transformer模型预测临床句子对之间的语义相似性:评估与表征分析
JMIR Med Inform. 2021 May 26;9(5):e23099. doi: 10.2196/23099.
3
Efficient Token-Guided Image-Text Retrieval With Consistent Multimodal Contrastive Training.高效的基于令牌的图像-文本检索与一致的多模态对比训练。
IEEE Trans Image Process. 2023;32:3622-3633. doi: 10.1109/TIP.2023.3286710. Epub 2023 Jul 3.
4
Fine-Grained Cross-Modal Semantic Consistency in Natural Conservation Image Data from a Multi-Task Perspective.从多任务视角看自然保护图像数据中的细粒度跨模态语义一致性
Sensors (Basel). 2024 May 14;24(10):3130. doi: 10.3390/s24103130.
5
The 2019 n2c2/OHNLP Track on Clinical Semantic Textual Similarity: Overview.2019年n2c2/OHNLP临床语义文本相似性赛道:概述
JMIR Med Inform. 2020 Nov 27;8(11):e23375. doi: 10.2196/23375.
6
SMAN: Stacked Multimodal Attention Network for Cross-Modal Image-Text Retrieval.SMAN:用于跨模态图像-文本检索的堆叠多模态注意力网络。
IEEE Trans Cybern. 2022 Feb;52(2):1086-1097. doi: 10.1109/TCYB.2020.2985716. Epub 2022 Feb 16.
7
Dependency-based Siamese long short-term memory network for learning sentence representations.基于依赖的孪生长短时记忆网络用于学习句子表示。
PLoS One. 2018 Mar 7;13(3):e0193919. doi: 10.1371/journal.pone.0193919. eCollection 2018.
8
Using Character-Level and Entity-Level Representations to Enhance Bidirectional Encoder Representation From Transformers-Based Clinical Semantic Textual Similarity Model: ClinicalSTS Modeling Study.使用字符级和实体级表示来增强基于Transformer的临床语义文本相似性模型的双向编码器表示:临床STS建模研究
JMIR Med Inform. 2020 Dec 29;8(12):e23357. doi: 10.2196/23357.
9
Distributed representation and one-hot representation fusion with gated network for clinical semantic textual similarity.基于门控网络的分布式表示和独热表示融合用于临床语义文本相似度。
BMC Med Inform Decis Mak. 2020 Apr 30;20(Suppl 1):72. doi: 10.1186/s12911-020-1045-z.
10
What Does a Language-And-Vision Transformer See: The Impact of Semantic Information on Visual Representations.语言与视觉Transformer看到了什么:语义信息对视觉表征的影响。
Front Artif Intell. 2021 Dec 3;4:767971. doi: 10.3389/frai.2021.767971. eCollection 2021.