• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于生物医学语料库预训练的句子嵌入的深度学习提高了在电子病历中查找相似句子的性能。

Deep learning with sentence embeddings pre-trained on biomedical corpora improves the performance of finding similar sentences in electronic medical records.

机构信息

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, USA.

School of Biomedical Informatics, UTHealth, Houston, USA.

出版信息

BMC Med Inform Decis Mak. 2020 Apr 30;20(Suppl 1):73. doi: 10.1186/s12911-020-1044-0.

DOI:10.1186/s12911-020-1044-0
PMID:32349758
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7191680/
Abstract

BACKGROUND

Capturing sentence semantics plays a vital role in a range of text mining applications. Despite continuous efforts on the development of related datasets and models in the general domain, both datasets and models are limited in biomedical and clinical domains. The BioCreative/OHNLP2018 organizers have made the first attempt to annotate 1068 sentence pairs from clinical notes and have called for a community effort to tackle the Semantic Textual Similarity (BioCreative/OHNLP STS) challenge.

METHODS

We developed models using traditional machine learning and deep learning approaches. For the post challenge, we focused on two models: the Random Forest and the Encoder Network. We applied sentence embeddings pre-trained on PubMed abstracts and MIMIC-III clinical notes and updated the Random Forest and the Encoder Network accordingly.

RESULTS

The official results demonstrated our best submission was the ensemble of eight models. It achieved a Person correlation coefficient of 0.8328 - the highest performance among 13 submissions from 4 teams. For the post challenge, the performance of both Random Forest and the Encoder Network was improved; in particular, the correlation of the Encoder Network was improved by ~ 13%. During the challenge task, no end-to-end deep learning models had better performance than machine learning models that take manually-crafted features. In contrast, with the sentence embeddings pre-trained on biomedical corpora, the Encoder Network now achieves a correlation of ~ 0.84, which is higher than the original best model. The ensembled model taking the improved versions of the Random Forest and Encoder Network as inputs further increased performance to 0.8528.

CONCLUSIONS

Deep learning models with sentence embeddings pre-trained on biomedical corpora achieve the highest performance on the test set. Through error analysis, we find that end-to-end deep learning models and traditional machine learning models with manually-crafted features complement each other by finding different types of sentences. We suggest a combination of these models can better find similar sentences in practice.

摘要

背景

捕捉句子语义在一系列文本挖掘应用中起着至关重要的作用。尽管在一般领域中不断努力开发相关数据集和模型,但这些数据集和模型在生物医学和临床领域都受到限制。BioCreative/OHNLP2018 组织者首次尝试对来自临床记录的 1068 对句子进行注释,并呼吁社区共同努力应对语义文本相似性(BioCreative/OHNLP STS)挑战。

方法

我们使用传统的机器学习和深度学习方法开发模型。对于后期挑战,我们专注于两个模型:随机森林和编码器网络。我们应用了在 PubMed 摘要和 MIMIC-III 临床记录上预训练的句子嵌入,并相应地更新了随机森林和编码器网络。

结果

官方结果表明,我们的最佳提交是八个模型的集成。它实现了 0.8328 的人员相关系数-在来自 4 个团队的 13 个提交中表现最好。对于后期挑战,随机森林和编码器网络的性能都得到了提高;特别是,编码器网络的相关性提高了约 13%。在挑战任务中,没有端到端的深度学习模型比采用人工制作特征的机器学习模型表现更好。相比之下,使用生物医学语料库预训练的句子嵌入,编码器网络现在实现了约 0.84 的相关性,高于原始最佳模型。集成模型将经过改进的随机森林和编码器网络作为输入,进一步将性能提高到 0.8528。

结论

使用生物医学语料库预训练的句子嵌入的深度学习模型在测试集上实现了最高性能。通过错误分析,我们发现端到端的深度学习模型和采用人工制作特征的传统机器学习模型通过找到不同类型的句子来互补。我们建议在实践中结合这些模型可以更好地找到相似的句子。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/474f/7191680/b26f430954d3/12911_2020_1044_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/474f/7191680/90549b971b3e/12911_2020_1044_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/474f/7191680/85400e1f2f66/12911_2020_1044_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/474f/7191680/c367556aa60a/12911_2020_1044_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/474f/7191680/64689af694f0/12911_2020_1044_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/474f/7191680/b26f430954d3/12911_2020_1044_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/474f/7191680/90549b971b3e/12911_2020_1044_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/474f/7191680/85400e1f2f66/12911_2020_1044_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/474f/7191680/c367556aa60a/12911_2020_1044_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/474f/7191680/64689af694f0/12911_2020_1044_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/474f/7191680/b26f430954d3/12911_2020_1044_Fig5_HTML.jpg

相似文献

1
Deep learning with sentence embeddings pre-trained on biomedical corpora improves the performance of finding similar sentences in electronic medical records.基于生物医学语料库预训练的句子嵌入的深度学习提高了在电子病历中查找相似句子的性能。
BMC Med Inform Decis Mak. 2020 Apr 30;20(Suppl 1):73. doi: 10.1186/s12911-020-1044-0.
2
A comparison of word embeddings for the biomedical natural language processing.生物医学自然语言处理中词嵌入的比较。
J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.
3
The 2019 n2c2/OHNLP Track on Clinical Semantic Textual Similarity: Overview.2019年n2c2/OHNLP临床语义文本相似性赛道:概述
JMIR Med Inform. 2020 Nov 27;8(11):e23375. doi: 10.2196/23375.
4
Benchmarking Effectiveness and Efficiency of Deep Learning Models for Semantic Textual Similarity in the Clinical Domain: Validation Study.临床领域语义文本相似度深度学习模型的有效性和效率基准测试:验证研究
JMIR Med Inform. 2021 Dec 30;9(12):e27386. doi: 10.2196/27386.
5
Identification of Semantically Similar Sentences in Clinical Notes: Iterative Intermediate Training Using Multi-Task Learning.临床笔记中语义相似句子的识别:使用多任务学习的迭代中间训练
JMIR Med Inform. 2020 Nov 27;8(11):e22508. doi: 10.2196/22508.
6
Predicting Semantic Similarity Between Clinical Sentence Pairs Using Transformer Models: Evaluation and Representational Analysis.使用Transformer模型预测临床句子对之间的语义相似性:评估与表征分析
JMIR Med Inform. 2021 May 26;9(5):e23099. doi: 10.2196/23099.
7
BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale.生物概念向量:在大规模上创建和评估基于文献的生物医学概念嵌入。
PLoS Comput Biol. 2020 Apr 23;16(4):e1007617. doi: 10.1371/journal.pcbi.1007617. eCollection 2020 Apr.
8
A Hybrid Model for Family History Information Identification and Relation Extraction: Development and Evaluation of an End-to-End Information Extraction System.一种用于家族病史信息识别与关系抽取的混合模型:一个端到端信息抽取系统的开发与评估
JMIR Med Inform. 2021 Apr 22;9(4):e22797. doi: 10.2196/22797.
9
Deep contextualized embeddings for quantifying the informative content in biomedical text summarization.用于量化生物医学文本摘要是信息内容的深度语境化嵌入。
Comput Methods Programs Biomed. 2020 Feb;184:105117. doi: 10.1016/j.cmpb.2019.105117. Epub 2019 Oct 4.
10
A study of deep learning methods for de-identification of clinical notes in cross-institute settings.深度学习方法在跨机构环境下对临床记录进行去识别的研究。
BMC Med Inform Decis Mak. 2019 Dec 5;19(Suppl 5):232. doi: 10.1186/s12911-019-0935-4.

引用本文的文献

1
A reproducible experimental survey on biomedical sentence similarity: A string-based method sets the state of the art.生物医学句子相似度的可重现实验调查:基于字符串的方法达到了最新水平。
PLoS One. 2022 Nov 21;17(11):e0276539. doi: 10.1371/journal.pone.0276539. eCollection 2022.
2
Benchmarking Effectiveness and Efficiency of Deep Learning Models for Semantic Textual Similarity in the Clinical Domain: Validation Study.临床领域语义文本相似度深度学习模型的有效性和效率基准测试:验证研究
JMIR Med Inform. 2021 Dec 30;9(12):e27386. doi: 10.2196/27386.
3
Automated Scoring of Tablet-Administered Expressive Language Tests.

本文引用的文献

1
A multi-task deep learning model for the classification of Age-related Macular Degeneration.一种用于年龄相关性黄斑变性分类的多任务深度学习模型。
AMIA Jt Summits Transl Sci Proc. 2019 May 6;2019:505-514. eCollection 2019.
2
ML-Net: multi-label classification of biomedical texts with deep neural networks.ML-Net:基于深度神经网络的生物医学文本多标签分类
J Am Med Inform Assoc. 2019 Nov 1;26(11):1279-1285. doi: 10.1093/jamia/ocz085.
3
PubTator central: automated concept annotation for biomedical full text articles.PubTator 中心:用于生物医学全文文章的自动概念标注。
片剂给药的表达性语言测试的自动评分
Front Psychol. 2021 Jul 22;12:668401. doi: 10.3389/fpsyg.2021.668401. eCollection 2021.
4
ECO-CollecTF: A Corpus of Annotated Evidence-Based Assertions in Biomedical Manuscripts.ECO-CollecTF:生物医学手稿中带注释的循证断言语料库。
Front Res Metr Anal. 2021 Jul 13;6:674205. doi: 10.3389/frma.2021.674205. eCollection 2021.
5
LitSuggest: a web-based system for literature recommendation and curation using machine learning.LitSuggest:一个使用机器学习进行文献推荐和管理的基于网络的系统。
Nucleic Acids Res. 2021 Jul 2;49(W1):W352-W358. doi: 10.1093/nar/gkab326.
6
Protocol for a reproducible experimental survey on biomedical sentence similarity.生物医学句子相似度可重复实验调查方案
PLoS One. 2021 Mar 24;16(3):e0248663. doi: 10.1371/journal.pone.0248663. eCollection 2021.
7
Measurement of Semantic Textual Similarity in Clinical Texts: Comparison of Transformer-Based Models.临床文本中语义文本相似度的测量:基于Transformer模型的比较。
JMIR Med Inform. 2020 Nov 23;8(11):e19735. doi: 10.2196/19735.
8
BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale.生物概念向量:在大规模上创建和评估基于文献的生物医学概念嵌入。
PLoS Comput Biol. 2020 Apr 23;16(4):e1007617. doi: 10.1371/journal.pcbi.1007617. eCollection 2020 Apr.
Nucleic Acids Res. 2019 Jul 2;47(W1):W587-W593. doi: 10.1093/nar/gkz389.
4
LitSense: making sense of biomedical literature at sentence level.LitSense:在句子层面上理解生物医学文献。
Nucleic Acids Res. 2019 Jul 2;47(W1):W594-W599. doi: 10.1093/nar/gkz289.
5
Overview of the BioCreative VI Precision Medicine Track: mining protein interactions and mutations for precision medicine.BioCreative VI 精准医学赛道概述:精准医学中的蛋白质相互作用和突变挖掘。
Database (Oxford). 2019 Jan 1;2019:bay147. doi: 10.1093/database/bay147.
6
How user intelligence is improving PubMed.用户智能如何提升PubMed。
Nat Biotechnol. 2018 Oct 1. doi: 10.1038/nbt.4267.
7
Extracting psychiatric stressors for suicide from social media using deep learning.利用深度学习从社交媒体中提取自杀相关的精神压力源
BMC Med Inform Decis Mak. 2018 Jul 23;18(Suppl 2):43. doi: 10.1186/s12911-018-0632-8.
8
CLAMP - a toolkit for efficiently building customized clinical natural language processing pipelines.CLAMP - 一个用于高效构建定制化临床自然语言处理管道的工具包。
J Am Med Inform Assoc. 2018 Mar 1;25(3):331-336. doi: 10.1093/jamia/ocx132.
9
tmVar 2.0: integrating genomic variant information from literature with dbSNP and ClinVar for precision medicine.tmVar 2.0:整合文献中的基因组变异信息与 dbSNP 和 ClinVar,以用于精准医学。
Bioinformatics. 2018 Jan 1;34(1):80-87. doi: 10.1093/bioinformatics/btx541.
10
BELMiner: adapting a rule-based relation extraction system to extract biological expression language statements from bio-medical literature evidence sentences.BELMiner:调整基于规则的关系提取系统,以从生物医学文献证据句子中提取生物表达语言陈述。
Database (Oxford). 2017 Jan 1;2017(1). doi: 10.1093/database/baw156.