Rybinski Maciej, Aldana-Montes José Francisco
Departamento LCC, University of Malaga, Campus Teatinos, Malaga, 29010, Spain.
J Biomed Semantics. 2016 Dec 28;7(1):67. doi: 10.1186/s13326-016-0109-6.
Semantic relatedness is a measure that quantifies the strength of a semantic link between two concepts. Often, it can be efficiently approximated with methods that operate on words, which represent these concepts. Approximating semantic relatedness between texts and concepts represented by these texts is an important part of many text and knowledge processing tasks of crucial importance in the ever growing domain of biomedical informatics. The problem of most state-of-the-art methods for calculating semantic relatedness is their dependence on highly specialized, structured knowledge resources, which makes these methods poorly adaptable for many usage scenarios. On the other hand, the domain knowledge in the Life Sciences has become more and more accessible, but mostly in its unstructured form - as texts in large document collections, which makes its use more challenging for automated processing. In this paper we present tESA, an extension to a well known Explicit Semantic Relatedness (ESA) method.
In our extension we use two separate sets of vectors, corresponding to different sections of the articles from the underlying corpus of documents, as opposed to the original method, which only uses a single vector space. We present an evaluation of Life Sciences domain-focused applicability of both tESA and domain-adapted Explicit Semantic Analysis. The methods are tested against a set of standard benchmarks established for the evaluation of biomedical semantic relatedness quality. Our experiments show that the propsed method achieves results comparable with or superior to the current state-of-the-art methods. Additionally, a comparative discussion of the results obtained with tESA and ESA is presented, together with a study of the adaptability of the methods to different corpora and their performance with different input parameters.
Our findings suggest that combined use of the semantics from different sections (i.e. extending the original ESA methodology with the use of title vectors) of the documents of scientific corpora may be used to enhance the performance of a distributional semantic relatedness measures, which can be observed in the largest reference datasets. We also present the impact of the proposed extension on the size of distributional representations.
语义相关性是一种量化两个概念之间语义联系强度的度量。通常,可以通过对表示这些概念的词进行操作的方法来有效地近似它。近似文本与这些文本所表示的概念之间的语义相关性是生物医学信息学不断发展领域中许多文本和知识处理任务的重要组成部分。大多数用于计算语义相关性的最先进方法的问题在于它们依赖于高度专业化的结构化知识资源,这使得这些方法在许多使用场景中适应性较差。另一方面,生命科学领域的知识越来越容易获取,但大多是以非结构化形式——如大型文档集合中的文本,这使得其在自动化处理中的使用更具挑战性。在本文中,我们提出了tESA,它是对一种著名的显式语义相关性(ESA)方法的扩展。
在我们的扩展中,我们使用了两组单独的向量,分别对应于基础文档语料库中文章的不同部分,而原始方法只使用单个向量空间。我们对tESA和领域适应的显式语义分析在生命科学领域的适用性进行了评估。这些方法针对为评估生物医学语义相关性质量而建立的一组标准基准进行了测试。我们的实验表明,所提出的方法取得了与当前最先进方法相当或更优的结果。此外,还对tESA和ESA获得的结果进行了比较讨论,同时研究了这些方法对不同语料库的适应性及其在不同输入参数下的性能。
我们的研究结果表明,结合使用科学语料库文档不同部分的语义(即通过使用标题向量扩展原始ESA方法)可用于提高分布语义相关性度量的性能,这在最大的参考数据集中可以观察到。我们还展示了所提出的扩展对分布表示大小的影响。