用于单词、短语和文本的无监督低维向量表示，具有透明性、可扩展性，并能产生与神经嵌入不冗余的相似性度量。

Unsupervised low-dimensional vector representations for words, phrases and text that are transparent, scalable, and produce similarity metrics that are not redundant with neural embeddings.

作者信息

Smalheiser Neil R, Cohen Aaron M, Bonifield Gary

机构信息

Department of Psychiatry and Psychiatric Institute, University of Illinois College of Medicine, Chicago, IL 60612, USA.

Department of Medical Informatics and Clinical Epidemiology, Oregon Health & Science University, Portland, OR 97239, USA.

出版信息

J Biomed Inform. 2019 Feb;90:103096. doi: 10.1016/j.jbi.2019.103096. Epub 2019 Jan 14.

DOI:10.1016/j.jbi.2019.103096

PMID:30654030

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6557457/

Abstract

Neural embeddings are a popular set of methods for representing words, phrases or text as a low dimensional vector (typically 50-500 dimensions). However, it is difficult to interpret these dimensions in a meaningful manner, and creating neural embeddings requires extensive training and tuning of multiple parameters and hyperparameters. We present here a simple unsupervised method for representing words, phrases or text as a low dimensional vector, in which the meaning and relative importance of dimensions is transparent to inspection. We have created a near-comprehensive vector representation of words, and selected bigrams, trigrams and abbreviations, using the set of titles and abstracts in PubMed as a corpus. This vector is used to create several novel implicit word-word and text-text similarity metrics. The implicit word-word similarity metrics correlate well with human judgement of word pair similarity and relatedness, and outperform or equal all other reported methods on a variety of biomedical benchmarks, including several implementations of neural embeddings trained on PubMed corpora. Our implicit word-word metrics capture different aspects of word-word relatedness than word2vec-based metrics and are only partially correlated (rho = 0.5-0.8 depending on task and corpus). The vector representations of words, bigrams, trigrams, abbreviations, and PubMed title + abstracts are all publicly available from http://arrowsmith.psych.uic.edu/arrowsmith_uic/word_similarity_metrics.html for release under CC-BY-NC license. Several public web query interfaces are also available at the same site, including one which allows the user to specify a given word and view its most closely related terms according to direct co-occurrence as well as different implicit similarity metrics.

摘要

神经嵌入是一组流行的方法，用于将单词、短语或文本表示为低维向量（通常为50 - 500维）。然而，难以以有意义的方式解释这些维度，并且创建神经嵌入需要对多个参数和超参数进行广泛的训练和调整。我们在此提出一种简单的无监督方法，用于将单词、短语或文本表示为低维向量，其中维度的含义和相对重要性通过检查是透明的。我们使用PubMed中的标题和摘要集作为语料库，创建了一个几乎全面的单词向量表示，并选择了双词、三词和缩写。这个向量用于创建几个新颖的隐式词 - 词和文本 - 文本相似性度量。隐式词 - 词相似性度量与人类对词对相似性和相关性的判断高度相关，并且在各种生物医学基准测试中优于或等同于所有其他报告的方法，包括在PubMed语料库上训练的神经嵌入的几种实现。我们的隐式词 - 词度量捕捉到的词 - 词相关性的不同方面与基于word2vec的度量不同，并且仅部分相关（根据任务和语料库，rho = 0.5 - 0.8）。单词、双词、三词、缩写以及PubMed标题 + 摘要的向量表示均可从http://arrowsmith.psych.uic.edu/arrowsmith_uic/word_similarity_metrics.html公开获取，根据CC - BY - NC许可发布。同一网站还提供了几个公共网络查询接口，包括一个允许用户指定一个给定单词并根据直接共现以及不同的隐式相似性度量查看其最相关术语的接口。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

用于单词、短语和文本的无监督低维向量表示，具有透明性、可扩展性，并能产生与神经嵌入不冗余的相似性度量。

Unsupervised low-dimensional vector representations for words, phrases and text that are transparent, scalable, and produce similarity metrics that are not redundant with neural embeddings.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

相似文献

引用本文的文献

本文引用的文献

用于单词、短语和文本的无监督低维向量表示，具有透明性、可扩展性，并能产生与神经嵌入不冗余的相似性度量。

Unsupervised low-dimensional vector representations for words, phrases and text that are transparent, scalable, and produce similarity metrics that are not redundant with neural embeddings.

作者信息

机构信息

出版信息