Smalheiser Neil R, Cohen Aaron M, Bonifield Gary
Department of Psychiatry and Psychiatric Institute, University of Illinois College of Medicine, Chicago, IL 60612, USA.
Department of Medical Informatics and Clinical Epidemiology, Oregon Health & Science University, Portland, OR 97239, USA.
J Biomed Inform. 2019 Feb;90:103096. doi: 10.1016/j.jbi.2019.103096. Epub 2019 Jan 14.
Neural embeddings are a popular set of methods for representing words, phrases or text as a low dimensional vector (typically 50-500 dimensions). However, it is difficult to interpret these dimensions in a meaningful manner, and creating neural embeddings requires extensive training and tuning of multiple parameters and hyperparameters. We present here a simple unsupervised method for representing words, phrases or text as a low dimensional vector, in which the meaning and relative importance of dimensions is transparent to inspection. We have created a near-comprehensive vector representation of words, and selected bigrams, trigrams and abbreviations, using the set of titles and abstracts in PubMed as a corpus. This vector is used to create several novel implicit word-word and text-text similarity metrics. The implicit word-word similarity metrics correlate well with human judgement of word pair similarity and relatedness, and outperform or equal all other reported methods on a variety of biomedical benchmarks, including several implementations of neural embeddings trained on PubMed corpora. Our implicit word-word metrics capture different aspects of word-word relatedness than word2vec-based metrics and are only partially correlated (rho = 0.5-0.8 depending on task and corpus). The vector representations of words, bigrams, trigrams, abbreviations, and PubMed title + abstracts are all publicly available from http://arrowsmith.psych.uic.edu/arrowsmith_uic/word_similarity_metrics.html for release under CC-BY-NC license. Several public web query interfaces are also available at the same site, including one which allows the user to specify a given word and view its most closely related terms according to direct co-occurrence as well as different implicit similarity metrics.
神经嵌入是一组流行的方法,用于将单词、短语或文本表示为低维向量(通常为50 - 500维)。然而,难以以有意义的方式解释这些维度,并且创建神经嵌入需要对多个参数和超参数进行广泛的训练和调整。我们在此提出一种简单的无监督方法,用于将单词、短语或文本表示为低维向量,其中维度的含义和相对重要性通过检查是透明的。我们使用PubMed中的标题和摘要集作为语料库,创建了一个几乎全面的单词向量表示,并选择了双词、三词和缩写。这个向量用于创建几个新颖的隐式词 - 词和文本 - 文本相似性度量。隐式词 - 词相似性度量与人类对词对相似性和相关性的判断高度相关,并且在各种生物医学基准测试中优于或等同于所有其他报告的方法,包括在PubMed语料库上训练的神经嵌入的几种实现。我们的隐式词 - 词度量捕捉到的词 - 词相关性的不同方面与基于word2vec的度量不同,并且仅部分相关(根据任务和语料库,rho = 0.5 - 0.8)。单词、双词、三词、缩写以及PubMed标题 + 摘要的向量表示均可从http://arrowsmith.psych.uic.edu/arrowsmith_uic/word_similarity_metrics.html公开获取,根据CC - BY - NC许可发布。同一网站还提供了几个公共网络查询接口,包括一个允许用户指定一个给定单词并根据直接共现以及不同的隐式相似性度量查看其最相关术语的接口。