BIOSSES：一种用于生物医学领域的语义句子相似度估计系统。

BIOSSES: a semantic sentence similarity estimation system for the biomedical domain.

作者信息

Sogancioglu Gizem, Öztürk Hakime, Özgür Arzucan

机构信息

Department of Computer Engineering, Bogazici University, Istanbul, Turkey.

R&D and Special Projects Department, Yapı Kredi Technology, Istanbul, Turkey.

出版信息

Bioinformatics. 2017 Jul 15;33(14):i49-i58. doi: 10.1093/bioinformatics/btx238.

DOI:10.1093/bioinformatics/btx238

PMID:28881973

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5870675/

Abstract

MOTIVATION

The amount of information available in textual format is rapidly increasing in the biomedical domain. Therefore, natural language processing (NLP) applications are becoming increasingly important to facilitate the retrieval and analysis of these data. Computing the semantic similarity between sentences is an important component in many NLP tasks including text retrieval and summarization. A number of approaches have been proposed for semantic sentence similarity estimation for generic English. However, our experiments showed that such approaches do not effectively cover biomedical knowledge and produce poor results for biomedical text.

METHODS

We propose several approaches for sentence-level semantic similarity computation in the biomedical domain, including string similarity measures and measures based on the distributed vector representations of sentences learned in an unsupervised manner from a large biomedical corpus. In addition, ontology-based approaches are presented that utilize general and domain-specific ontologies. Finally, a supervised regression based model is developed that effectively combines the different similarity computation metrics. A benchmark data set consisting of 100 sentence pairs from the biomedical literature is manually annotated by five human experts and used for evaluating the proposed methods.

RESULTS

The experiments showed that the supervised semantic sentence similarity computation approach obtained the best performance (0.836 correlation with gold standard human annotations) and improved over the state-of-the-art domain-independent systems up to 42.6% in terms of the Pearson correlation metric.

AVAILABILITY AND IMPLEMENTATION

A web-based system for biomedical semantic sentence similarity computation, the source code, and the annotated benchmark data set are available at: http://tabilab.cmpe.boun.edu.tr/BIOSSES/ .

CONTACT

gizemsogancioglu@gmail.com or arzucan.ozgur@boun.edu.tr.

摘要

动机

生物医学领域中以文本格式存在的信息量正在迅速增长。因此，自然语言处理（NLP）应用对于促进这些数据的检索和分析变得越来越重要。计算句子之间的语义相似度是许多NLP任务（包括文本检索和摘要）中的一个重要组成部分。已经提出了许多方法来估计通用英语的语义句子相似度。然而，我们的实验表明，这些方法不能有效地涵盖生物医学知识，并且对于生物医学文本会产生较差的结果。

方法

我们提出了几种用于生物医学领域句子级语义相似度计算的方法，包括字符串相似度度量和基于从大型生物医学语料库中无监督学习得到的句子分布式向量表示的度量。此外，还提出了基于本体的方法，这些方法利用了通用和特定领域的本体。最后，开发了一种基于监督回归的模型，该模型有效地结合了不同的相似度计算指标。一个由来自生物医学文献的100个句子对组成的基准数据集由五名人类专家进行人工标注，并用于评估所提出的方法。

结果

实验表明，监督语义句子相似度计算方法获得了最佳性能（与黄金标准人工标注的相关性为0.836），并且在皮尔逊相关度量方面比最先进的独立于领域的系统提高了42.6%。

可用性和实现

用于生物医学语义句子相似度计算的基于网络的系统、源代码和带注释的基准数据集可在以下网址获得：http://tabilab.cmpe.boun.edu.tr/BIOSSES/ 。

联系方式

gizemsogancioglu@gmail.com或arzucan.ozgur@boun.edu.tr 。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

BIOSSES：一种用于生物医学领域的语义句子相似度估计系统。

BIOSSES: a semantic sentence similarity estimation system for the biomedical domain.

作者信息

机构信息

出版信息

MOTIVATION

METHODS

RESULTS

AVAILABILITY AND IMPLEMENTATION

CONTACT

动机

方法

结果

可用性和实现

联系方式

相似文献

引用本文的文献

本文引用的文献

BIOSSES：一种用于生物医学领域的语义句子相似度估计系统。

BIOSSES: a semantic sentence similarity estimation system for the biomedical domain.

作者信息

机构信息

出版信息

MOTIVATION

METHODS

RESULTS

AVAILABILITY AND IMPLEMENTATION

CONTACT

动机

方法

结果

可用性和实现

联系方式

相似文献

引用本文的文献

本文引用的文献