Suppr超能文献

生物医学领域中语义相似度估计的神经句子嵌入模型。

Neural sentence embedding models for semantic similarity estimation in the biomedical domain.

机构信息

Section for Artificial Intelligence and Decision Support, Center for Medical Statistics, Informatics, and Intelligent Systems, Medical University of Vienna, Währinger Straße 25a, 1090, Vienna, Austria.

出版信息

BMC Bioinformatics. 2019 Apr 11;20(1):178. doi: 10.1186/s12859-019-2789-2.

Abstract

BACKGROUND

Neural network based embedding models are receiving significant attention in the field of natural language processing due to their capability to effectively capture semantic information representing words, sentences or even larger text elements in low-dimensional vector space. While current state-of-the-art models for assessing the semantic similarity of textual statements from biomedical publications depend on the availability of laboriously curated ontologies, unsupervised neural embedding models only require large text corpora as input and do not need manual curation. In this study, we investigated the efficacy of current state-of-the-art neural sentence embedding models for semantic similarity estimation of sentences from biomedical literature. We trained different neural embedding models on 1.7 million articles from the PubMed Open Access dataset, and evaluated them based on a biomedical benchmark set containing 100 sentence pairs annotated by human experts and a smaller contradiction subset derived from the original benchmark set.

RESULTS

Experimental results showed that, with a Pearson correlation of 0.819, our best unsupervised model based on the Paragraph Vector Distributed Memory algorithm outperforms previous state-of-the-art results achieved on the BIOSSES biomedical benchmark set. Moreover, our proposed supervised model that combines different string-based similarity metrics with a neural embedding model surpasses previous ontology-dependent supervised state-of-the-art approaches in terms of Pearson's r (r = 0.871) on the biomedical benchmark set. In contrast to the promising results for the original benchmark, we found our best models' performance on the smaller contradiction subset to be poor.

CONCLUSIONS

In this study, we have highlighted the value of neural network-based models for semantic similarity estimation in the biomedical domain by showing that they can keep up with and even surpass previous state-of-the-art approaches for semantic similarity estimation that depend on the availability of laboriously curated ontologies, when evaluated on a biomedical benchmark set. Capturing contradictions and negations in biomedical sentences, however, emerged as an essential area for further work.

摘要

背景

由于神经网络嵌入模型能够有效地从低维向量空间中捕获代表单词、句子甚至更大文本元素的语义信息,因此在自然语言处理领域受到了广泛关注。虽然当前评估生物医学文献中文本语句语义相似度的最先进模型依赖于费力制作的本体,但无监督神经网络嵌入模型仅需要大型文本语料库作为输入,并且不需要人工制作。在这项研究中,我们研究了当前最先进的神经句子嵌入模型在评估生物医学文献中句子语义相似度方面的功效。我们在 PubMed 开放获取数据集的 170 万篇文章上训练了不同的神经嵌入模型,并根据包含 100 对由人类专家注释的句子的生物医学基准集和源自原始基准集的较小矛盾子集对其进行了评估。

结果

实验结果表明,我们基于 Paragraph Vector 分布式记忆算法的最佳无监督模型的 Pearson 相关系数为 0.819,优于之前在 BIOSSES 生物医学基准集上取得的最先进结果。此外,我们提出的将基于字符串的相似性度量与神经嵌入模型相结合的有监督模型在生物医学基准集上的 Pearson r(r=0.871)方面超过了之前依赖本体的有监督最先进方法。与原始基准集的有希望结果相比,我们发现我们的最佳模型在较小的矛盾子集上的性能较差。

结论

在这项研究中,我们通过展示在生物医学基准集上评估时,它们可以与依赖费力制作的本体的语义相似度评估的先前最先进方法保持一致,甚至超越它们,突出了基于神经网络的模型在生物医学领域语义相似度估计中的价值。然而,捕捉生物医学句子中的矛盾和否定成为进一步研究的重要领域。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e295/6460644/2ff2ff8aac24/12859_2019_2789_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验