• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

联合使用语料库和知识库学习词向量。

Jointly learning word embeddings using a corpus and a knowledge base.

机构信息

Department of Computer Science, University of Liverpool, Liverpool, United Kingdom.

Kawarabayashi ERATO Large Graph Project, Tokyo, Japan.

出版信息

PLoS One. 2018 Mar 12;13(3):e0193094. doi: 10.1371/journal.pone.0193094. eCollection 2018.

DOI:10.1371/journal.pone.0193094
PMID:29529052
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5847320/
Abstract

Methods for representing the meaning of words in vector spaces purely using the information distributed in text corpora have proved to be very valuable in various text mining and natural language processing (NLP) tasks. However, these methods still disregard the valuable semantic relational structure between words in co-occurring contexts. These beneficial semantic relational structures are contained in manually-created knowledge bases (KBs) such as ontologies and semantic lexicons, where the meanings of words are represented by defining the various relationships that exist among those words. We combine the knowledge in both a corpus and a KB to learn better word embeddings. Specifically, we propose a joint word representation learning method that uses the knowledge in the KBs, and simultaneously predicts the co-occurrences of two words in a corpus context. In particular, we use the corpus to define our objective function subject to the relational constrains derived from the KB. We further utilise the corpus co-occurrence statistics to propose two novel approaches, Nearest Neighbour Expansion (NNE) and Hedged Nearest Neighbour Expansion (HNE), that dynamically expand the KB and therefore derive more constraints that guide the optimisation process. Our experimental results over a wide-range of benchmark tasks demonstrate that the proposed method statistically significantly improves the accuracy of the word embeddings learnt. It outperforms a corpus-only baseline and reports an improvement of a number of previously proposed methods that incorporate corpora and KBs in both semantic similarity prediction and word analogy detection tasks.

摘要

使用仅从文本语料库中分布的信息来表示词汇意义的向量空间方法已被证明在各种文本挖掘和自然语言处理(NLP)任务中非常有价值。但是,这些方法仍然忽略了共现上下文中单词之间宝贵的语义关系结构。这些有益的语义关系结构包含在手动创建的知识库(KB)中,例如本体和语义词典,其中单词的含义通过定义这些单词之间存在的各种关系来表示。我们结合语料库和 KB 中的知识来学习更好的单词嵌入。具体来说,我们提出了一种联合单词表示学习方法,该方法使用 KB 中的知识,并同时预测语料库上下文中两个单词的共现。特别地,我们使用语料库来定义我们的目标函数,同时服从从 KB 中得出的关系约束。我们进一步利用语料库共现统计数据提出了两种新颖的方法,即最近邻扩展(NNE)和避险最近邻扩展(HNE),这两种方法可以动态扩展 KB,从而得出更多的约束条件来指导优化过程。我们在广泛的基准任务上的实验结果表明,所提出的方法在统计学上显著提高了学习到的单词嵌入的准确性。它优于仅语料库的基线,并在语义相似性预测和单词类比检测任务中报告了一些先前提出的同时结合语料库和 KB 的方法的改进。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b710/5847320/0c78db7021d1/pone.0193094.g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b710/5847320/bcd684d77dec/pone.0193094.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b710/5847320/86b639a84ff0/pone.0193094.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b710/5847320/0c78db7021d1/pone.0193094.g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b710/5847320/bcd684d77dec/pone.0193094.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b710/5847320/86b639a84ff0/pone.0193094.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b710/5847320/0c78db7021d1/pone.0193094.g007.jpg

相似文献

1
Jointly learning word embeddings using a corpus and a knowledge base.联合使用语料库和知识库学习词向量。
PLoS One. 2018 Mar 12;13(3):e0193094. doi: 10.1371/journal.pone.0193094. eCollection 2018.
2
Evaluating semantic relations in neural word embeddings with biomedical and general domain knowledge bases.利用生物医学和一般领域知识库评估神经词汇嵌入中的语义关系。
BMC Med Inform Decis Mak. 2018 Jul 23;18(Suppl 2):65. doi: 10.1186/s12911-018-0630-x.
3
Fine-Tuning Word Embeddings for Hierarchical Representation of Data Using a Corpus and a Knowledge Base for Various Machine Learning Applications.使用语料库和知识库对数据进行层次表示的词向量微调,用于各种机器学习应用。
Comput Math Methods Med. 2021 Nov 16;2021:9761163. doi: 10.1155/2021/9761163. eCollection 2021.
4
A comparison of word embeddings for the biomedical natural language processing.生物医学自然语言处理中词嵌入的比较。
J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.
5
Knowledge based word-concept model estimation and refinement for biomedical text mining.用于生物医学文本挖掘的基于知识的词概念模型估计与优化。
J Biomed Inform. 2015 Feb;53:300-7. doi: 10.1016/j.jbi.2014.11.015. Epub 2014 Dec 12.
6
Incorporating linguistic knowledge for learning distributed word representations.整合语言知识以学习分布式词表示。
PLoS One. 2015 Apr 13;10(4):e0118437. doi: 10.1371/journal.pone.0118437. eCollection 2015.
7
An Unsupervised Graph Based Continuous Word Representation Method for Biomedical Text Mining.一种用于生物医学文本挖掘的基于无监督图的连续词表示方法。
IEEE/ACM Trans Comput Biol Bioinform. 2016 Jul-Aug;13(4):634-42. doi: 10.1109/TCBB.2015.2478467. Epub 2015 Sep 14.
8
Multi-Ontology Refined Embeddings (MORE): A hybrid multi-ontology and corpus-based semantic representation model for biomedical concepts.多本体精炼嵌入模型(MORE):一种基于混合多本体和语料库的生物医学概念语义表示模型。
J Biomed Inform. 2020 Nov;111:103581. doi: 10.1016/j.jbi.2020.103581. Epub 2020 Oct 1.
9
Unsupervised low-dimensional vector representations for words, phrases and text that are transparent, scalable, and produce similarity metrics that are not redundant with neural embeddings.用于单词、短语和文本的无监督低维向量表示,具有透明性、可扩展性,并能产生与神经嵌入不冗余的相似性度量。
J Biomed Inform. 2019 Feb;90:103096. doi: 10.1016/j.jbi.2019.103096. Epub 2019 Jan 14.
10
Biomedical Text Classification Using Augmented Word Representation Based on Distributional and Relational Contexts.基于分布和关系上下文的增强词表示法进行生物医学文本分类
Comput Intell Neurosci. 2023 Feb 15;2023:2989791. doi: 10.1155/2023/2989791. eCollection 2023.

引用本文的文献

1
The geometry of meaning: evaluating sentence embeddings from diverse transformer-based models for natural language inference.意义的几何学:评估基于不同Transformer模型的句子嵌入用于自然语言推理
PeerJ Comput Sci. 2025 Jun 16;11:e2957. doi: 10.7717/peerj-cs.2957. eCollection 2025.
2
Biomedical Text Classification Using Augmented Word Representation Based on Distributional and Relational Contexts.基于分布和关系上下文的增强词表示法进行生物医学文本分类
Comput Intell Neurosci. 2023 Feb 15;2023:2989791. doi: 10.1155/2023/2989791. eCollection 2023.
3
Multi-Ontology Refined Embeddings (MORE): A hybrid multi-ontology and corpus-based semantic representation model for biomedical concepts.

本文引用的文献

1
Extracting microRNA-gene relations from biomedical literature using distant supervision.利用远程监督从生物医学文献中提取微小RNA-基因关系。
PLoS One. 2017 Mar 6;12(3):e0171929. doi: 10.1371/journal.pone.0171929. eCollection 2017.
2
The Unified Medical Language System (UMLS): integrating biomedical terminology.统一医学语言系统(UMLS):整合生物医学术语。
Nucleic Acids Res. 2004 Jan 1;32(Database issue):D267-70. doi: 10.1093/nar/gkh061.
多本体精炼嵌入模型(MORE):一种基于混合多本体和语料库的生物医学概念语义表示模型。
J Biomed Inform. 2020 Nov;111:103581. doi: 10.1016/j.jbi.2020.103581. Epub 2020 Oct 1.
4
Exploring semantic deep learning for building reliable and reusable one health knowledge from PubMed systematic reviews and veterinary clinical notes.探索语义深度学习,以便从PubMed系统评价和兽医临床记录中构建可靠且可重复使用的一体化健康知识。
J Biomed Semantics. 2019 Nov 12;10(Suppl 1):22. doi: 10.1186/s13326-019-0212-6.
5
A Year of Papers Using Biomedical Texts: Findings from the Section on Natural Language Processing of the IMIA Yearbook.使用生物医学文本的论文之年:IMIA年鉴自然语言处理章节的研究结果
Yearb Med Inform. 2019 Aug;28(1):218-222. doi: 10.1055/s-0039-1677937. Epub 2019 Aug 16.