一种结合手动标注和深度学习自然语言处理的遗传性疾病相关生物医学文献中精确实体抽取方法的研究。

A Combined Manual Annotation and Deep-Learning Natural Language Processing Study on Accurate Entity Extraction in Hereditary Disease Related Biomedical Literature.

机构信息

BGI Research, Shenzhen, 518083, China.

Clinical Laboratory of BGI Health, BGI-Shenzhen, Shenzhen, 518083, China.

出版信息

Interdiscip Sci. 2024 Jun;16(2):333-344. doi: 10.1007/s12539-024-00605-2. Epub 2024 Feb 10.

DOI:10.1007/s12539-024-00605-2

PMID:38340264

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11289304/

Abstract

We report a combined manual annotation and deep-learning natural language processing study to make accurate entity extraction in hereditary disease related biomedical literature. A total of 400 full articles were manually annotated based on published guidelines by experienced genetic interpreters at Beijing Genomics Institute (BGI). The performance of our manual annotations was assessed by comparing our re-annotated results with those publicly available. The overall Jaccard index was calculated to be 0.866 for the four entity types-gene, variant, disease and species. Both a BERT-based large name entity recognition (NER) model and a DistilBERT-based simplified NER model were trained, validated and tested, respectively. Due to the limited manually annotated corpus, Such NER models were fine-tuned with two phases. The F1-scores of BERT-based NER for gene, variant, disease and species are 97.28%, 93.52%, 92.54% and 95.76%, respectively, while those of DistilBERT-based NER are 95.14%, 86.26%, 91.37% and 89.92%, respectively. Most importantly, the entity type of variant has been extracted by a large language model for the first time and a comparable F1-score with the state-of-the-art variant extraction model tmVar has been achieved.

摘要

我们报告了一项结合手动注释和深度学习自然语言处理的研究，旨在对遗传性疾病相关生物医学文献进行准确的实体提取。根据北京基因组研究所（BGI）经验丰富的遗传解释器发布的指南，共手动注释了 400 篇完整的文章。我们通过与公开可用的结果进行重新注释的结果来评估我们的手动注释的性能。四种实体类型（基因、变体、疾病和物种）的整体 Jaccard 指数计算为 0.866。基于 BERT 的大型命名实体识别（NER）模型和基于 DistilBERT 的简化 NER 模型分别进行了训练、验证和测试。由于手动注释语料库有限，因此对这些 NER 模型进行了两个阶段的微调。基于 BERT 的 NER 模型在基因、变体、疾病和物种方面的 F1 分数分别为 97.28%、93.52%、92.54%和 95.76%，而基于 DistilBERT 的 NER 模型的 F1 分数分别为 95.14%、86.26%、91.37%和 89.92%。最重要的是，变体的实体类型首次由大型语言模型提取，并达到了与最先进的变体提取模型 tmVar 相当的 F1 分数。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/52cd/11289304/0c9474a6f4cd/12539_2024_605_Fig1_HTML.jpg

相似文献

A Combined Manual Annotation and Deep-Learning Natural Language Processing Study on Accurate Entity Extraction in Hereditary Disease Related Biomedical Literature.一种结合手动标注和深度学习自然语言处理的遗传性疾病相关生物医学文献中精确实体抽取方法的研究。

Interdiscip Sci. 2024 Jun;16(2):333-344. doi: 10.1007/s12539-024-00605-2. Epub 2024 Feb 10.

Vocabulary Matters: An Annotation Pipeline and Four Deep Learning Algorithms for Enzyme Named Entity Recognition.词汇很重要：用于酶命名实体识别的标注流水线和四个深度学习算法。

J Proteome Res. 2024 Jun 7;23(6):1915-1925. doi: 10.1021/acs.jproteome.3c00367. Epub 2024 May 11.

Evaluating Medical Entity Recognition in Health Care: Entity Model Quantitative Study.评估医疗保健中的实体识别：实体模型定量研究。

JMIR Med Inform. 2024 Oct 17;12:e59782. doi: 10.2196/59782.

Extracting comprehensive clinical information for breast cancer using deep learning methods.利用深度学习方法提取乳腺癌全面临床信息。

Int J Med Inform. 2019 Dec;132:103985. doi: 10.1016/j.ijmedinf.2019.103985. Epub 2019 Oct 2.

NCBI disease corpus: a resource for disease name recognition and concept normalization.NCBI疾病语料库：一种用于疾病名称识别和概念规范化的资源。

J Biomed Inform. 2014 Feb;47:1-10. doi: 10.1016/j.jbi.2013.12.006. Epub 2014 Jan 3.

Integrating deep learning architectures for enhanced biomedical relation extraction: a pipeline approach.深度学习架构在增强生物医学关系抽取中的应用：一种流水线方法。

Database (Oxford). 2024 Aug 28;2024. doi: 10.1093/database/baae079.

From zero to hero: Harnessing transformers for biomedical named entity recognition in zero- and few-shot contexts.从零到英雄：利用变压器在零样本和少样本上下文中进行生物医学命名实体识别。

Artif Intell Med. 2024 Oct;156:102970. doi: 10.1016/j.artmed.2024.102970. Epub 2024 Aug 24.

Exploration of biomedical knowledge for recurrent glioblastoma using natural language processing deep learning models.利用自然语言处理深度学习模型探索复发性脑胶质瘤的生物医学知识。

BMC Med Inform Decis Mak. 2022 Oct 13;22(1):267. doi: 10.1186/s12911-022-02003-4.

MetaboListem and TABoLiSTM: Two Deep Learning Algorithms for Metabolite Named Entity Recognition.MetaboListem和TABoLiSTM：两种用于代谢物命名实体识别的深度学习算法。

Metabolites. 2022 Mar 22;12(4):276. doi: 10.3390/metabo12040276.

A Fine-Tuned Bidirectional Encoder Representations From Transformers Model for Food Named-Entity Recognition: Algorithm Development and Validation.基于 Transformer 的双向编码器表示模型的精细调整在食品命名实体识别中的应用：算法开发与验证。

J Med Internet Res. 2021 Aug 9;23(8):e28229. doi: 10.2196/28229.

本文引用的文献

On the effectiveness of compact biomedical transformers.紧凑型生物医学变压器的有效性。

Bioinformatics. 2023 Mar 1;39(3). doi: 10.1093/bioinformatics/btad103.

BERN2: an advanced neural biomedical named entity recognition and normalization tool.BERN2：一种先进的神经生物医学命名实体识别和标准化工具。

Bioinformatics. 2022 Oct 14;38(20):4837-4839. doi: 10.1093/bioinformatics/btac598.

HunFlair: an easy-to-use tool for state-of-the-art biomedical named entity recognition.HunFlair：一种用于最先进生物医学命名实体识别的易于使用的工具。

Bioinformatics. 2021 Sep 9;37(17):2792-2794. doi: 10.1093/bioinformatics/btab042.

The Human Phenotype Ontology in 2021.2021 年人类表型本体论。

Nucleic Acids Res. 2021 Jan 8;49(D1):D1207-D1217. doi: 10.1093/nar/gkaa1043.

BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale.生物概念向量：在大规模上创建和评估基于文献的生物医学概念嵌入。

PLoS Comput Biol. 2020 Apr 23;16(4):e1007617. doi: 10.1371/journal.pcbi.1007617. eCollection 2020 Apr.

Biomedical named entity recognition using deep neural networks with contextual information.基于上下文信息的深度神经网络的生物医学命名实体识别。

BMC Bioinformatics. 2019 Dec 27;20(1):735. doi: 10.1186/s12859-019-3321-4.

BioBERT: a pre-trained biomedical language representation model for biomedical text mining.BioBERT：一种用于生物医学文本挖掘的预训练生物医学语言表示模型。

Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.

PubTator central: automated concept annotation for biomedical full text articles.PubTator 中心：用于生物医学全文文章的自动概念标注。

Nucleic Acids Res. 2019 Jul 2;47(W1):W587-W593. doi: 10.1093/nar/gkz389.

Document-level attention-based BiLSTM-CRF incorporating disease dictionary for disease named entity recognition.基于文档级注意力的 BiLSTM-CRF 结合疾病词典的疾病命名实体识别。

Comput Biol Med. 2019 May;108:122-132. doi: 10.1016/j.compbiomed.2019.04.002. Epub 2019 Apr 7.

Genomic Analysis in the Age of Human Genome Sequencing.人类基因组测序时代的基因组分析。

Cell. 2019 Mar 21;177(1):70-84. doi: 10.1016/j.cell.2019.02.032.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

一种结合手动标注和深度学习自然语言处理的遗传性疾病相关生物医学文献中精确实体抽取方法的研究。

A Combined Manual Annotation and Deep-Learning Natural Language Processing Study on Accurate Entity Extraction in Hereditary Disease Related Biomedical Literature.

机构信息

出版信息

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献