National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, 20894, USA.
Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland.
Sci Data. 2024 Sep 9;11(1):982. doi: 10.1038/s41597-024-03835-7.
Expert curation is essential to capture knowledge of enzyme functions from the scientific literature in FAIR open knowledgebases but cannot keep pace with the rate of new discoveries and new publications. In this work we present EnzChemRED, for Enzyme Chemistry Relation Extraction Dataset, a new training and benchmarking dataset to support the development of Natural Language Processing (NLP) methods such as (large) language models that can assist enzyme curation. EnzChemRED consists of 1,210 expert curated PubMed abstracts where enzymes and the chemical reactions they catalyze are annotated using identifiers from the protein knowledgebase UniProtKB and the chemical ontology ChEBI. We show that fine-tuning language models with EnzChemRED significantly boosts their ability to identify proteins and chemicals in text (86.30% F score) and to extract the chemical conversions (86.66% F score) and the enzymes that catalyze those conversions (83.79% F score). We apply our methods to abstracts at PubMed scale to create a draft map of enzyme functions in literature to guide curation efforts in UniProtKB and the reaction knowledgebase Rhea.
专家策展对于从 FAIR 开放知识库中的科学文献中捕获酶功能知识至关重要,但无法跟上新发现和新出版物的速度。在这项工作中,我们提出了 EnzChemRED,即酶化学关系提取数据集,这是一个新的培训和基准数据集,旨在支持自然语言处理 (NLP) 方法的开发,例如(大型)语言模型,这些模型可以辅助酶的策展。EnzChemRED 由 1210 篇经过专家策展的 PubMed 摘要组成,其中使用来自蛋白质知识库 UniProtKB 和化学本体 ChEBI 的标识符对酶和它们催化的化学反应进行注释。我们表明,使用 EnzChemRED 对语言模型进行微调可以显著提高其在文本中识别蛋白质和化学物质的能力(86.30% F 分数),以及提取化学转化(86.66% F 分数)和催化这些转化的酶(83.79% F 分数)的能力。我们将我们的方法应用于 PubMed 规模的摘要,以创建文献中酶功能的草图地图,以指导 UniProtKB 和反应知识库 Rhea 中的策展工作。