词汇很重要：用于酶命名实体识别的标注流水线和四个深度学习算法。

Vocabulary Matters: An Annotation Pipeline and Four Deep Learning Algorithms for Enzyme Named Entity Recognition.

机构信息

Section of Bioinformatics, Division of Systems Medicine, Department of Metabolism, Digestion and Reproduction, Imperial College London, London W12 0NN, U.K.

UKRI Centre for Doctoral Training in AI for Healthcare, Department of Computing, Imperial College London, London SW7 2AZ, U.K.

出版信息

J Proteome Res. 2024 Jun 7;23(6):1915-1925. doi: 10.1021/acs.jproteome.3c00367. Epub 2024 May 11.

DOI:10.1021/acs.jproteome.3c00367

PMID:38733346

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11165580/

Abstract

Enzymes are indispensable in many biological processes, and with biomedical literature growing exponentially, effective literature review becomes increasingly challenging. Natural language processing methods offer solutions to streamline this process. This study aims to develop an annotated enzyme corpus for training and evaluating enzyme named entity recognition (NER) models. A novel pipeline, combining dictionary matching and rule-based keyword searching, automatically annotated enzyme entities in >4800 full-text publications. Four deep learning NER models were created with different vocabularies (BioBERT/SciBERT) and architectures (BiLSTM/transformer) and evaluated on 526 manually annotated full-text publications. The annotation pipeline achieved an 1-score of 0.86 (precision = 1.00, recall = 0.76), surpassed by fine-tuned transformers for 1-score (BioBERT: 0.89, SciBERT: 0.88) and recall (0.86) with BiLSTM models having higher precision (0.94) than transformers (0.92). The annotation pipeline runs in seconds on standard laptops with almost perfect precision, but was outperformed by fine-tuned transformers in terms of 1-score and recall, demonstrating generalizability beyond the training data. In comparison, SciBERT-based models exhibited higher precision, and BioBERT-based models exhibited higher recall, highlighting the importance of vocabulary and architecture. These models, representing the first enzyme NER algorithms, enable more effective enzyme text mining and information extraction. Codes for automated annotation and model generation are available from https://github.com/omicsNLP/enzymeNER and https://zenodo.org/doi/10.5281/zenodo.10581586.

摘要

酶在许多生物过程中不可或缺，随着生物医学文献呈指数级增长，有效的文献综述变得越来越具有挑战性。自然语言处理方法为简化这一过程提供了解决方案。本研究旨在开发一个经过注释的酶语料库，用于训练和评估酶命名实体识别（NER）模型。一种新的结合词典匹配和基于规则的关键词搜索的流水线，自动注释了 4800 多篇全文出版物中的酶实体。创建了四个基于深度学习的 NER 模型，具有不同的词汇表（BioBERT/SciBERT）和架构（BiLSTM/transformer），并在 526 篇手动注释的全文出版物上进行了评估。注释流水线的 1 分得分为 0.86（精度=1.00，召回率=0.76），经微调的转换器的 1 分得分（BioBERT：0.89，SciBERT：0.88）和召回率（0.86）超过了它，BiLSTM 模型的精度（0.94）高于转换器（0.92）。注释流水线在标准笔记本电脑上运行速度快，几乎达到了完美的精度，但在 1 分得分和召回率方面，经微调的转换器表现更好，表明其具有超出训练数据的泛化能力。相比之下，基于 SciBERT 的模型表现出更高的精度，而基于 BioBERT 的模型则表现出更高的召回率，突出了词汇和架构的重要性。这些模型代表了第一批酶 NER 算法，使酶文本挖掘和信息提取更加有效。自动化注释和模型生成的代码可从 https://github.com/omicsNLP/enzymeNER 和 https://zenodo.org/doi/10.5281/zenodo.10581586 获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8e4b/11165580/297127a0cad6/pr3c00367_0001.jpg

相似文献

Vocabulary Matters: An Annotation Pipeline and Four Deep Learning Algorithms for Enzyme Named Entity Recognition.词汇很重要：用于酶命名实体识别的标注流水线和四个深度学习算法。

J Proteome Res. 2024 Jun 7;23(6):1915-1925. doi: 10.1021/acs.jproteome.3c00367. Epub 2024 May 11.

MetaboListem and TABoLiSTM: Two Deep Learning Algorithms for Metabolite Named Entity Recognition.MetaboListem和TABoLiSTM：两种用于代谢物命名实体识别的深度学习算法。

Metabolites. 2022 Mar 22;12(4):276. doi: 10.3390/metabo12040276.

Improving dictionary-based named entity recognition with deep learning.利用深度学习改进基于字典的命名实体识别。

Bioinformatics. 2024 Sep 1;40(Suppl 2):ii45-ii52. doi: 10.1093/bioinformatics/btae402.

Improving entity recognition using ensembles of deep learning and fine-tuned large language models: A case study on adverse event extraction from VAERS and social media.使用深度学习集成和微调大语言模型改进实体识别：以从VAERS和社交媒体中提取不良事件为例

J Biomed Inform. 2025 Mar;163:104789. doi: 10.1016/j.jbi.2025.104789. Epub 2025 Feb 7.

A Combined Manual Annotation and Deep-Learning Natural Language Processing Study on Accurate Entity Extraction in Hereditary Disease Related Biomedical Literature.一种结合手动标注和深度学习自然语言处理的遗传性疾病相关生物医学文献中精确实体抽取方法的研究。

Interdiscip Sci. 2024 Jun;16(2):333-344. doi: 10.1007/s12539-024-00605-2. Epub 2024 Feb 10.

From zero to hero: Harnessing transformers for biomedical named entity recognition in zero- and few-shot contexts.从零到英雄：利用变压器在零样本和少样本上下文中进行生物医学命名实体识别。

Artif Intell Med. 2024 Oct;156:102970. doi: 10.1016/j.artmed.2024.102970. Epub 2024 Aug 24.

Transformers-sklearn: a toolkit for medical language understanding with transformer-based models.Transformer-sklearn：一个基于 Transformer 的模型的医学语言理解工具包。

BMC Med Inform Decis Mak. 2021 Jul 30;21(Suppl 2):90. doi: 10.1186/s12911-021-01459-0.

Evaluating Medical Entity Recognition in Health Care: Entity Model Quantitative Study.评估医疗保健中的实体识别：实体模型定量研究。

JMIR Med Inform. 2024 Oct 17;12:e59782. doi: 10.2196/59782.

Lifestyle factors in the biomedical literature: an ontology and comprehensive resources for named entity recognition.生物医学文献中的生活方式因素：命名实体识别的本体和综合资源。

Bioinformatics. 2024 Nov 1;40(11). doi: 10.1093/bioinformatics/btae613.

Hybrid natural language processing tool for semantic annotation of medical texts in Spanish.用于西班牙语医学文本语义标注的混合自然语言处理工具。

BMC Bioinformatics. 2025 Jan 8;26(1):7. doi: 10.1186/s12859-024-05949-6.

本文引用的文献

BERN2: an advanced neural biomedical named entity recognition and normalization tool.BERN2：一种先进的神经生物医学命名实体识别和标准化工具。

Bioinformatics. 2022 Oct 14;38(20):4837-4839. doi: 10.1093/bioinformatics/btac598.

MetaboListem and TABoLiSTM: Two Deep Learning Algorithms for Metabolite Named Entity Recognition.MetaboListem和TABoLiSTM：两种用于代谢物命名实体识别的深度学习算法。

Metabolites. 2022 Mar 22;12(4):276. doi: 10.3390/metabo12040276.

Auto-CORPus: A Natural Language Processing Tool for Standardizing and Reusing Biomedical Literature.自动语料库：一种用于规范和复用生物医学文献的自然语言处理工具。

Front Digit Health. 2022 Feb 15;4:788124. doi: 10.3389/fdgth.2022.788124. eCollection 2022.

Biomedical named entity recognition using BERT in the machine reading comprehension framework.基于机器阅读理解框架的 BERT 在生物医学命名实体识别中的应用。

J Biomed Inform. 2021 Jun;118:103799. doi: 10.1016/j.jbi.2021.103799. Epub 2021 May 6.

NLM-Chem, a new resource for chemical entity recognition in PubMed full text literature.NLM-Chem，一个用于 PubMed 全文文献中化学实体识别的新资源。

Sci Data. 2021 Mar 25;8(1):91. doi: 10.1038/s41597-021-00875-1.

BRENDA, the ELIXIR core data resource in 2021: new developments and updates.BRENDA，2021 年的 ELIXIR 核心数据资源：新的发展和更新。

Nucleic Acids Res. 2021 Jan 8;49(D1):D498-D508. doi: 10.1093/nar/gkaa1025.

TeamTat: a collaborative text annotation tool.TeamTat：一个协作文本注释工具。

Nucleic Acids Res. 2020 Jul 2;48(W1):W5-W11. doi: 10.1093/nar/gkaa333.

Combinatorial feature embedding based on CNN and LSTM for biomedical named entity recognition.基于 CNN 和 LSTM 的组合特征嵌入的生物医学命名实体识别。

J Biomed Inform. 2020 Mar;103:103381. doi: 10.1016/j.jbi.2020.103381. Epub 2020 Jan 28.

BioBERT: a pre-trained biomedical language representation model for biomedical text mining.BioBERT：一种用于生物医学文本挖掘的预训练生物医学语言表示模型。

Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.

CollaboNet: collaboration of deep neural networks for biomedical named entity recognition.CollaboNet：用于生物医学命名实体识别的深度神经网络协作。

BMC Bioinformatics. 2019 May 29;20(Suppl 10):249. doi: 10.1186/s12859-019-2813-6.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

词汇很重要：用于酶命名实体识别的标注流水线和四个深度学习算法。

Vocabulary Matters: An Annotation Pipeline and Four Deep Learning Algorithms for Enzyme Named Entity Recognition.

机构信息

出版信息

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献