• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

生物医学领域中语义相似度估计的神经句子嵌入模型。

Neural sentence embedding models for semantic similarity estimation in the biomedical domain.

机构信息

Section for Artificial Intelligence and Decision Support, Center for Medical Statistics, Informatics, and Intelligent Systems, Medical University of Vienna, Währinger Straße 25a, 1090, Vienna, Austria.

出版信息

BMC Bioinformatics. 2019 Apr 11;20(1):178. doi: 10.1186/s12859-019-2789-2.

DOI:10.1186/s12859-019-2789-2
PMID:30975071
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6460644/
Abstract

BACKGROUND

Neural network based embedding models are receiving significant attention in the field of natural language processing due to their capability to effectively capture semantic information representing words, sentences or even larger text elements in low-dimensional vector space. While current state-of-the-art models for assessing the semantic similarity of textual statements from biomedical publications depend on the availability of laboriously curated ontologies, unsupervised neural embedding models only require large text corpora as input and do not need manual curation. In this study, we investigated the efficacy of current state-of-the-art neural sentence embedding models for semantic similarity estimation of sentences from biomedical literature. We trained different neural embedding models on 1.7 million articles from the PubMed Open Access dataset, and evaluated them based on a biomedical benchmark set containing 100 sentence pairs annotated by human experts and a smaller contradiction subset derived from the original benchmark set.

RESULTS

Experimental results showed that, with a Pearson correlation of 0.819, our best unsupervised model based on the Paragraph Vector Distributed Memory algorithm outperforms previous state-of-the-art results achieved on the BIOSSES biomedical benchmark set. Moreover, our proposed supervised model that combines different string-based similarity metrics with a neural embedding model surpasses previous ontology-dependent supervised state-of-the-art approaches in terms of Pearson's r (r = 0.871) on the biomedical benchmark set. In contrast to the promising results for the original benchmark, we found our best models' performance on the smaller contradiction subset to be poor.

CONCLUSIONS

In this study, we have highlighted the value of neural network-based models for semantic similarity estimation in the biomedical domain by showing that they can keep up with and even surpass previous state-of-the-art approaches for semantic similarity estimation that depend on the availability of laboriously curated ontologies, when evaluated on a biomedical benchmark set. Capturing contradictions and negations in biomedical sentences, however, emerged as an essential area for further work.

摘要

背景

由于神经网络嵌入模型能够有效地从低维向量空间中捕获代表单词、句子甚至更大文本元素的语义信息,因此在自然语言处理领域受到了广泛关注。虽然当前评估生物医学文献中文本语句语义相似度的最先进模型依赖于费力制作的本体,但无监督神经网络嵌入模型仅需要大型文本语料库作为输入,并且不需要人工制作。在这项研究中,我们研究了当前最先进的神经句子嵌入模型在评估生物医学文献中句子语义相似度方面的功效。我们在 PubMed 开放获取数据集的 170 万篇文章上训练了不同的神经嵌入模型,并根据包含 100 对由人类专家注释的句子的生物医学基准集和源自原始基准集的较小矛盾子集对其进行了评估。

结果

实验结果表明,我们基于 Paragraph Vector 分布式记忆算法的最佳无监督模型的 Pearson 相关系数为 0.819,优于之前在 BIOSSES 生物医学基准集上取得的最先进结果。此外,我们提出的将基于字符串的相似性度量与神经嵌入模型相结合的有监督模型在生物医学基准集上的 Pearson r(r=0.871)方面超过了之前依赖本体的有监督最先进方法。与原始基准集的有希望结果相比,我们发现我们的最佳模型在较小的矛盾子集上的性能较差。

结论

在这项研究中,我们通过展示在生物医学基准集上评估时,它们可以与依赖费力制作的本体的语义相似度评估的先前最先进方法保持一致,甚至超越它们,突出了基于神经网络的模型在生物医学领域语义相似度估计中的价值。然而,捕捉生物医学句子中的矛盾和否定成为进一步研究的重要领域。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e295/6460644/8ed8be060a65/12859_2019_2789_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e295/6460644/2ff2ff8aac24/12859_2019_2789_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e295/6460644/8ed8be060a65/12859_2019_2789_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e295/6460644/2ff2ff8aac24/12859_2019_2789_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e295/6460644/8ed8be060a65/12859_2019_2789_Fig2_HTML.jpg

相似文献

1
Neural sentence embedding models for semantic similarity estimation in the biomedical domain.生物医学领域中语义相似度估计的神经句子嵌入模型。
BMC Bioinformatics. 2019 Apr 11;20(1):178. doi: 10.1186/s12859-019-2789-2.
2
BIOSSES: a semantic sentence similarity estimation system for the biomedical domain.BIOSSES:一种用于生物医学领域的语义句子相似度估计系统。
Bioinformatics. 2017 Jul 15;33(14):i49-i58. doi: 10.1093/bioinformatics/btx238.
3
Fast and scalable neural embedding models for biomedical sentence classification.用于生物医学句子分类的快速可扩展神经嵌入模型。
BMC Bioinformatics. 2018 Dec 22;19(1):541. doi: 10.1186/s12859-018-2496-4.
4
Clinical Context-Aware Biomedical Text Summarization Using Deep Neural Network: Model Development and Validation.基于深度神经网络的临床相关生物医学文本摘要:模型开发与验证。
J Med Internet Res. 2020 Oct 23;22(10):e19810. doi: 10.2196/19810.
5
Quantifying semantic similarity of clinical evidence in the biomedical literature to facilitate related evidence synthesis.量化生物医学文献中临床证据的语义相似度,以促进相关证据的综合。
J Biomed Inform. 2019 Dec;100:103321. doi: 10.1016/j.jbi.2019.103321. Epub 2019 Oct 30.
6
The 2019 n2c2/OHNLP Track on Clinical Semantic Textual Similarity: Overview.2019年n2c2/OHNLP临床语义文本相似性赛道:概述
JMIR Med Inform. 2020 Nov 27;8(11):e23375. doi: 10.2196/23375.
7
Semantic relatedness and similarity of biomedical terms: examining the effects of recency, size, and section of biomedical publications on the performance of word2vec.生物医学术语的语义相关性和相似性:研究生物医学出版物的时效性、篇幅大小和章节对word2vec性能的影响。
BMC Med Inform Decis Mak. 2017 Jul 3;17(1):95. doi: 10.1186/s12911-017-0498-1.
8
Predicting Semantic Similarity Between Clinical Sentence Pairs Using Transformer Models: Evaluation and Representational Analysis.使用Transformer模型预测临床句子对之间的语义相似性:评估与表征分析
JMIR Med Inform. 2021 May 26;9(5):e23099. doi: 10.2196/23099.
9
A comparison of word embeddings for the biomedical natural language processing.生物医学自然语言处理中词嵌入的比较。
J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.
10
Corpus domain effects on distributional semantic modeling of medical terms.语料库领域对医学术语分布语义建模的影响。
Bioinformatics. 2016 Dec 1;32(23):3635-3644. doi: 10.1093/bioinformatics/btw529. Epub 2016 Aug 16.

引用本文的文献

1
Cognitive Computing-Based CDSS in Medical Practice.医学实践中基于认知计算的临床决策支持系统
Health Data Sci. 2021 Jul 22;2021:9819851. doi: 10.34133/2021/9819851. eCollection 2021.
2
Using language models and ontology topology to perform semantic mapping of traits between biomedical datasets.利用语言模型和本体拓扑结构对生物医学数据集之间的特征进行语义映射。
Bioinformatics. 2023 Apr 3;39(4). doi: 10.1093/bioinformatics/btad169.
3
A reproducible experimental survey on biomedical sentence similarity: A string-based method sets the state of the art.

本文引用的文献

1
BIOSSES: a semantic sentence similarity estimation system for the biomedical domain.BIOSSES:一种用于生物医学领域的语义句子相似度估计系统。
Bioinformatics. 2017 Jul 15;33(14):i49-i58. doi: 10.1093/bioinformatics/btx238.
2
Semantic similarity in the biomedical domain: an evaluation across knowledge sources.生物医学领域的语义相似度:跨知识源的评估。
BMC Bioinformatics. 2012 Oct 10;13:261. doi: 10.1186/1471-2105-13-261.
生物医学句子相似度的可重现实验调查:基于字符串的方法达到了最新水平。
PLoS One. 2022 Nov 21;17(11):e0276539. doi: 10.1371/journal.pone.0276539. eCollection 2022.
4
HESML: a real-time semantic measures library for the biomedical domain with a reproducible survey.HESML:生物医学领域的实时语义度量库,附有可重现的调查。
BMC Bioinformatics. 2022 Jan 6;23(1):23. doi: 10.1186/s12859-021-04539-0.
5
Protocol for a reproducible experimental survey on biomedical sentence similarity.生物医学句子相似度可重复实验调查方案
PLoS One. 2021 Mar 24;16(3):e0248663. doi: 10.1371/journal.pone.0248663. eCollection 2021.