Section of Bioinformatics, Division of Systems Medicine, Department of Metabolism, Digestion and Reproduction, Imperial College London, London W12 0NN, U.K.
UKRI Centre for Doctoral Training in AI for Healthcare, Department of Computing, Imperial College London, London SW7 2AZ, U.K.
J Proteome Res. 2024 Jun 7;23(6):1915-1925. doi: 10.1021/acs.jproteome.3c00367. Epub 2024 May 11.
Enzymes are indispensable in many biological processes, and with biomedical literature growing exponentially, effective literature review becomes increasingly challenging. Natural language processing methods offer solutions to streamline this process. This study aims to develop an annotated enzyme corpus for training and evaluating enzyme named entity recognition (NER) models. A novel pipeline, combining dictionary matching and rule-based keyword searching, automatically annotated enzyme entities in >4800 full-text publications. Four deep learning NER models were created with different vocabularies (BioBERT/SciBERT) and architectures (BiLSTM/transformer) and evaluated on 526 manually annotated full-text publications. The annotation pipeline achieved an 1-score of 0.86 (precision = 1.00, recall = 0.76), surpassed by fine-tuned transformers for 1-score (BioBERT: 0.89, SciBERT: 0.88) and recall (0.86) with BiLSTM models having higher precision (0.94) than transformers (0.92). The annotation pipeline runs in seconds on standard laptops with almost perfect precision, but was outperformed by fine-tuned transformers in terms of 1-score and recall, demonstrating generalizability beyond the training data. In comparison, SciBERT-based models exhibited higher precision, and BioBERT-based models exhibited higher recall, highlighting the importance of vocabulary and architecture. These models, representing the first enzyme NER algorithms, enable more effective enzyme text mining and information extraction. Codes for automated annotation and model generation are available from https://github.com/omicsNLP/enzymeNER and https://zenodo.org/doi/10.5281/zenodo.10581586.
酶在许多生物过程中不可或缺,随着生物医学文献呈指数级增长,有效的文献综述变得越来越具有挑战性。自然语言处理方法为简化这一过程提供了解决方案。本研究旨在开发一个经过注释的酶语料库,用于训练和评估酶命名实体识别(NER)模型。一种新的结合词典匹配和基于规则的关键词搜索的流水线,自动注释了 4800 多篇全文出版物中的酶实体。创建了四个基于深度学习的 NER 模型,具有不同的词汇表(BioBERT/SciBERT)和架构(BiLSTM/transformer),并在 526 篇手动注释的全文出版物上进行了评估。注释流水线的 1 分得分为 0.86(精度=1.00,召回率=0.76),经微调的转换器的 1 分得分(BioBERT:0.89,SciBERT:0.88)和召回率(0.86)超过了它,BiLSTM 模型的精度(0.94)高于转换器(0.92)。注释流水线在标准笔记本电脑上运行速度快,几乎达到了完美的精度,但在 1 分得分和召回率方面,经微调的转换器表现更好,表明其具有超出训练数据的泛化能力。相比之下,基于 SciBERT 的模型表现出更高的精度,而基于 BioBERT 的模型则表现出更高的召回率,突出了词汇和架构的重要性。这些模型代表了第一批酶 NER 算法,使酶文本挖掘和信息提取更加有效。自动化注释和模型生成的代码可从 https://github.com/omicsNLP/enzymeNER 和 https://zenodo.org/doi/10.5281/zenodo.10581586 获得。