Suppr超能文献

用于单词、短语和文本的无监督低维向量表示,具有透明性、可扩展性,并能产生与神经嵌入不冗余的相似性度量。

Unsupervised low-dimensional vector representations for words, phrases and text that are transparent, scalable, and produce similarity metrics that are not redundant with neural embeddings.

作者信息

Smalheiser Neil R, Cohen Aaron M, Bonifield Gary

机构信息

Department of Psychiatry and Psychiatric Institute, University of Illinois College of Medicine, Chicago, IL 60612, USA.

Department of Medical Informatics and Clinical Epidemiology, Oregon Health & Science University, Portland, OR 97239, USA.

出版信息

J Biomed Inform. 2019 Feb;90:103096. doi: 10.1016/j.jbi.2019.103096. Epub 2019 Jan 14.

Abstract

Neural embeddings are a popular set of methods for representing words, phrases or text as a low dimensional vector (typically 50-500 dimensions). However, it is difficult to interpret these dimensions in a meaningful manner, and creating neural embeddings requires extensive training and tuning of multiple parameters and hyperparameters. We present here a simple unsupervised method for representing words, phrases or text as a low dimensional vector, in which the meaning and relative importance of dimensions is transparent to inspection. We have created a near-comprehensive vector representation of words, and selected bigrams, trigrams and abbreviations, using the set of titles and abstracts in PubMed as a corpus. This vector is used to create several novel implicit word-word and text-text similarity metrics. The implicit word-word similarity metrics correlate well with human judgement of word pair similarity and relatedness, and outperform or equal all other reported methods on a variety of biomedical benchmarks, including several implementations of neural embeddings trained on PubMed corpora. Our implicit word-word metrics capture different aspects of word-word relatedness than word2vec-based metrics and are only partially correlated (rho = 0.5-0.8 depending on task and corpus). The vector representations of words, bigrams, trigrams, abbreviations, and PubMed title + abstracts are all publicly available from http://arrowsmith.psych.uic.edu/arrowsmith_uic/word_similarity_metrics.html for release under CC-BY-NC license. Several public web query interfaces are also available at the same site, including one which allows the user to specify a given word and view its most closely related terms according to direct co-occurrence as well as different implicit similarity metrics.

摘要

神经嵌入是一组流行的方法,用于将单词、短语或文本表示为低维向量(通常为50 - 500维)。然而,难以以有意义的方式解释这些维度,并且创建神经嵌入需要对多个参数和超参数进行广泛的训练和调整。我们在此提出一种简单的无监督方法,用于将单词、短语或文本表示为低维向量,其中维度的含义和相对重要性通过检查是透明的。我们使用PubMed中的标题和摘要集作为语料库,创建了一个几乎全面的单词向量表示,并选择了双词、三词和缩写。这个向量用于创建几个新颖的隐式词 - 词和文本 - 文本相似性度量。隐式词 - 词相似性度量与人类对词对相似性和相关性的判断高度相关,并且在各种生物医学基准测试中优于或等同于所有其他报告的方法,包括在PubMed语料库上训练的神经嵌入的几种实现。我们的隐式词 - 词度量捕捉到的词 - 词相关性的不同方面与基于word2vec的度量不同,并且仅部分相关(根据任务和语料库,rho = 0.5 - 0.8)。单词、双词、三词、缩写以及PubMed标题 + 摘要的向量表示均可从http://arrowsmith.psych.uic.edu/arrowsmith_uic/word_similarity_metrics.html公开获取,根据CC - BY - NC许可发布。同一网站还提供了几个公共网络查询接口,包括一个允许用户指定一个给定单词并根据直接共现以及不同的隐式相似性度量查看其最相关术语的接口。

相似文献

6
Corpus domain effects on distributional semantic modeling of medical terms.语料库领域对医学术语分布语义建模的影响。
Bioinformatics. 2016 Dec 1;32(23):3635-3644. doi: 10.1093/bioinformatics/btw529. Epub 2016 Aug 16.
8
Jointly learning word embeddings using a corpus and a knowledge base.联合使用语料库和知识库学习词向量。
PLoS One. 2018 Mar 12;13(3):e0193094. doi: 10.1371/journal.pone.0193094. eCollection 2018.
9
Vector representations of multi-word terms for semantic relatedness.多词术语的语义关联的向量表示。
J Biomed Inform. 2018 Jan;77:111-119. doi: 10.1016/j.jbi.2017.12.006. Epub 2017 Dec 13.

本文引用的文献

2
Vector representations of multi-word terms for semantic relatedness.多词术语的语义关联的向量表示。
J Biomed Inform. 2018 Jan;77:111-119. doi: 10.1016/j.jbi.2017.12.006. Epub 2017 Dec 13.
4
Representing Documents via Latent Keyphrase Inference.通过潜在关键短语推理来表示文档。
Proc Int World Wide Web Conf. 2016 Apr;2016:1057-1067. doi: 10.1145/2872427.2883088.
6
Corpus domain effects on distributional semantic modeling of medical terms.语料库领域对医学术语分布语义建模的影响。
Bioinformatics. 2016 Dec 1;32(23):3635-3644. doi: 10.1093/bioinformatics/btw529. Epub 2016 Aug 16.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验