基于BERT的生物医学实体规范化排序

BERT-based Ranking for Biomedical Entity Normalization.

作者信息

Ji Zongcheng, Wei Qiang, Xu Hua

机构信息

School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA.

出版信息

AMIA Jt Summits Transl Sci Proc. 2020 May 30;2020:269-277. eCollection 2020.

PMID:32477646

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7233044/

Abstract

Developing high-performance entity normalization algorithms that can alleviate the term variation problem is of great interest to the biomedical community. Although deep learning-based methods have been successfully applied to biomedical entity normalization, they often depend on traditional context-independent word embeddings. Bidirectional Encoder Representations from Transformers (BERT), BERT for Biomedical Text Mining (BioBERT) and BERT for Clinical Text Mining (ClinicalBERT) were recently introduced to pre-train contextualized word representation models using bidirectional Transformers, advancing the state-of-the-art for many natural language processing tasks. In this study, we proposed an entity normalization architecture by fine-tuning the pre-trained BERT / BioBERT / ClinicalBERT models and conducted extensive experiments to evaluate the effectiveness of the pre-trained models for biomedical entity normalization using three different types of datasets. Our experimental results show that the best fine-tuned models consistently outperformed previous methods and advanced the state-of-the-art for biomedical entity normalization, with up to 1.17% increase in accuracy.

摘要

开发能够缓解术语变化问题的高性能实体归一化算法，引起了生物医学界的极大兴趣。尽管基于深度学习的方法已成功应用于生物医学实体归一化，但它们通常依赖于传统的上下文无关词嵌入。最近引入了来自Transformer的双向编码器表示（BERT）、用于生物医学文本挖掘的BERT（BioBERT）和用于临床文本挖掘的BERT（ClinicalBERT），以使用双向Transformer预训练上下文相关词表示模型，推动了许多自然语言处理任务的技术发展。在本研究中，我们通过微调预训练的BERT/BioBERT/ClinicalBERT模型提出了一种实体归一化架构，并使用三种不同类型的数据集进行了广泛实验，以评估预训练模型在生物医学实体归一化方面的有效性。我们的实验结果表明，最佳微调模型始终优于先前的方法，并推动了生物医学实体归一化的技术发展，准确率提高了1.17%。

相似文献

BERT-based Ranking for Biomedical Entity Normalization.基于BERT的生物医学实体规范化排序

AMIA Jt Summits Transl Sci Proc. 2020 May 30;2020:269-277. eCollection 2020.

BioBERT: a pre-trained biomedical language representation model for biomedical text mining.BioBERT：一种用于生物医学文本挖掘的预训练生物医学语言表示模型。

Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.

Fine-Tuning Bidirectional Encoder Representations From Transformers (BERT)-Based Models on Large-Scale Electronic Health Record Notes: An Empirical Study.基于大规模电子健康记录笔记对基于变换器的双向编码器表征（BERT）模型进行微调：一项实证研究。

JMIR Med Inform. 2019 Sep 12;7(3):e14830. doi: 10.2196/14830.

Extracting comprehensive clinical information for breast cancer using deep learning methods.利用深度学习方法提取乳腺癌全面临床信息。

Int J Med Inform. 2019 Dec;132:103985. doi: 10.1016/j.ijmedinf.2019.103985. Epub 2019 Oct 2.

Deep contextualized embeddings for quantifying the informative content in biomedical text summarization.用于量化生物医学文本摘要是信息内容的深度语境化嵌入。

Comput Methods Programs Biomed. 2020 Feb;184:105117. doi: 10.1016/j.cmpb.2019.105117. Epub 2019 Oct 4.

A Fine-Tuned Bidirectional Encoder Representations From Transformers Model for Food Named-Entity Recognition: Algorithm Development and Validation.基于 Transformer 的双向编码器表示模型的精细调整在食品命名实体识别中的应用：算法开发与验证。

J Med Internet Res. 2021 Aug 9;23(8):e28229. doi: 10.2196/28229.

Methods Mol Biol. 2022;2496:221-235. doi: 10.1007/978-1-0716-2305-3_12.

Relation Classification for Bleeding Events From Electronic Health Records Using Deep Learning Systems: An Empirical Study.使用深度学习系统对电子健康记录中的出血事件进行关系分类：一项实证研究。

JMIR Med Inform. 2021 Jul 2;9(7):e27527. doi: 10.2196/27527.

Drug knowledge discovery via multi-task learning and pre-trained models.通过多任务学习和预训练模型进行药物知识发现。

BMC Med Inform Decis Mak. 2021 Nov 16;21(Suppl 9):251. doi: 10.1186/s12911-021-01614-7.

Identification of Semantically Similar Sentences in Clinical Notes: Iterative Intermediate Training Using Multi-Task Learning.临床笔记中语义相似句子的识别：使用多任务学习的迭代中间训练

JMIR Med Inform. 2020 Nov 27;8(11):e22508. doi: 10.2196/22508.

引用本文的文献

Benchmarking Transformer Embedding Models for Biomedical Terminology Standardization.用于生物医学术语标准化的基准测试变压器嵌入模型

Mach Learn Appl. 2025 Sep;21. doi: 10.1016/j.mlwa.2025.100683. Epub 2025 Jun 5.

A Large Language Model Outperforms Other Computational Approaches to the High-Throughput Phenotyping of Physician Notes.在医生笔记的高通量表型分析中，大型语言模型优于其他计算方法。

AMIA Annu Symp Proc. 2025 May 22;2024:838-846. eCollection 2024.

Mapping Drug Terms via Integration of a Retrieval-Augmented Generation Algorithm with a Large Language Model.通过将检索增强生成算法与大语言模型相结合来映射药物术语

Healthc Inform Res. 2024 Oct;30(4):355-363. doi: 10.4258/hir.2024.30.4.355. Epub 2024 Oct 31.

Unsupervised SapBERT-based bi-encoders for medical concept annotation of clinical narratives with SNOMED CT.基于无监督SapBERT的双编码器，用于使用SNOMED CT对临床叙述进行医学概念注释。

Digit Health. 2024 Oct 21;10:20552076241288681. doi: 10.1177/20552076241288681. eCollection 2024 Jan-Dec.

NSSC: a neuro-symbolic AI system for enhancing accuracy of named entity recognition and linking from oncologic clinical notes.NSSC：一种用于提高肿瘤临床记录中命名实体识别和链接准确性的神经符号人工智能系统。

Med Biol Eng Comput. 2025 Mar;63(3):749-772. doi: 10.1007/s11517-024-03227-4. Epub 2024 Nov 1.

CACER: Clinical concept Annotations for Cancer Events and Relations.CACER：癌症事件与关系的临床概念注释。

J Am Med Inform Assoc. 2024 Nov 1;31(11):2583-2594. doi: 10.1093/jamia/ocae231.

Chemical entity normalization for successful translational development of Alzheimer's disease and dementia therapeutics.化学实体标准化对阿尔茨海默病和痴呆症治疗药物的成功转化开发至关重要。

J Biomed Semantics. 2024 Jul 31;15(1):13. doi: 10.1186/s13326-024-00314-1.

Transformers and large language models in healthcare: A review.医疗保健中的变压器和大型语言模型：综述。

Artif Intell Med. 2024 Aug;154:102900. doi: 10.1016/j.artmed.2024.102900. Epub 2024 Jun 5.

Ways to make artificial intelligence work for healthcare professionals: correspondence.让人工智能为医疗专业人员服务的方法：通信

Antimicrob Steward Healthc Epidemiol. 2024 Jun 4;4(1):e95. doi: 10.1017/ash.2024.85. eCollection 2024.

NeighBERT: Medical Entity Linking Using Relation-Induced Dense Retrieval.NeighBERT：使用关系诱导密集检索的医学实体链接

J Healthc Inform Res. 2024 Jan 18;8(2):353-369. doi: 10.1007/s41666-023-00136-3. eCollection 2024 Jun.

本文引用的文献

Relation Extraction from Clinical Narratives Using Pre-trained Language Models.使用预训练语言模型从临床叙述中提取关系

AMIA Annu Symp Proc. 2020 Mar 4;2019:1236-1245. eCollection 2019.

BioBERT: a pre-trained biomedical language representation model for biomedical text mining.BioBERT：一种用于生物医学文本挖掘的预训练生物医学语言表示模型。

Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.

Enhancing clinical concept extraction with contextual embeddings.利用上下文嵌入增强临床概念提取。

J Am Med Inform Assoc. 2019 Nov 1;26(11):1297-1304. doi: 10.1093/jamia/ocz096.

CLAMP - a toolkit for efficiently building customized clinical natural language processing pipelines.CLAMP - 一个用于高效构建定制化临床自然语言处理管道的工具包。

J Am Med Inform Assoc. 2018 Mar 1;25(3):331-336. doi: 10.1093/jamia/ocx132.

CNN-based ranking for biomedical entity normalization.基于卷积神经网络的生物医学实体标准化排序

BMC Bioinformatics. 2017 Oct 3;18(Suppl 11):385. doi: 10.1186/s12859-017-1805-7.

TaggerOne: joint named entity recognition and normalization with semi-Markov Models.TaggerOne：使用半马尔可夫模型进行联合命名实体识别与归一化

Bioinformatics. 2016 Sep 15;32(18):2839-46. doi: 10.1093/bioinformatics/btw343. Epub 2016 Jun 9.

MIMIC-III, a freely accessible critical care database.MIMIC-III，一个免费获取的重症监护数据库。

Sci Data. 2016 May 24;3:160035. doi: 10.1038/sdata.2016.35.

Evaluating the state of the art in disorder recognition and normalization of the clinical narrative.评估临床病历中疾病识别和规范化的当前技术水平。

J Am Med Inform Assoc. 2015 Jan;22(1):143-54. doi: 10.1136/amiajnl-2013-002544. Epub 2014 Aug 21.

Large-scale linear rankSVM.大规模线性秩支持向量机。

Neural Comput. 2014 Apr;26(4):781-817. doi: 10.1162/NECO_a_00571. Epub 2014 Jan 30.

NCBI disease corpus: a resource for disease name recognition and concept normalization.NCBI疾病语料库：一种用于疾病名称识别和概念规范化的资源。

J Biomed Inform. 2014 Feb;47:1-10. doi: 10.1016/j.jbi.2013.12.006. Epub 2014 Jan 3.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验