多词术语的语义关联的向量表示。

Vector representations of multi-word terms for semantic relatedness.

机构信息

Department of Computer Science, Virginia Commonwealth University, 401 S. Main St., Richmond, VA 23284, USA.

Department of Computer Science, Virginia Commonwealth University, 401 S. Main St., Richmond, VA 23284, USA.

出版信息

J Biomed Inform. 2018 Jan;77:111-119. doi: 10.1016/j.jbi.2017.12.006. Epub 2017 Dec 13.

Abstract

This paper presents a comparison between several multi-word term aggregation methods of distributional context vectors applied to the task of semantic similarity and relatedness in the biomedical domain. We compare the multi-word term aggregation methods of summation of component word vectors, mean of component word vectors, direct construction of compound term vectors using the compoundify tool, and direct construction of concept vectors using the MetaMap tool. Dimensionality reduction is critical when constructing high quality distributional context vectors, so these baseline co-occurrence vectors are compared against dimensionality reduced vectors created using singular value decomposition (SVD), and word2vec word embeddings using continuous bag of words (CBOW), and skip-gram models. We also find optimal vector dimensionalities for the vectors produced by these techniques. Our results show that none of the tested multi-word term aggregation methods is statistically significantly better than any other. This allows flexibility when choosing a multi-word term aggregation method, and means expensive corpora preprocessing may be avoided. Results are shown with several standard evaluation datasets, and state of the results are achieved.

摘要

本文对几种应用于生物医学领域语义相似性和相关性任务的分布语境向量的多词项聚合方法进行了比较。我们比较了组件词向量求和、组件词向量均值、使用 compoundify 工具直接构建复合词向量以及使用 MetaMap 工具直接构建概念向量的多词项聚合方法。在构建高质量分布语境向量时,降维至关重要,因此,我们将这些基线共现向量与使用奇异值分解 (SVD) 创建的降维向量以及使用连续词袋 (CBOW) 和 skip-gram 模型的 word2vec 词嵌入进行了比较。我们还找到了这些技术产生的向量的最佳向量维度。我们的结果表明,测试的多词项聚合方法中没有一种在统计学上显著优于其他方法。这在选择多词项聚合方法时提供了灵活性,并且意味着可以避免昂贵的语料库预处理。我们使用几个标准评估数据集展示了结果,并取得了当前的最佳结果。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索