EnzChemRED，一个富含酶化学关系提取的数据集。

EnzChemRED, a rich enzyme chemistry relation extraction dataset.

机构信息

National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, 20894, USA.

Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland.

出版信息

Sci Data. 2024 Sep 9;11(1):982. doi: 10.1038/s41597-024-03835-7.

DOI:10.1038/s41597-024-03835-7

PMID:39251610

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11384730/

Abstract

Expert curation is essential to capture knowledge of enzyme functions from the scientific literature in FAIR open knowledgebases but cannot keep pace with the rate of new discoveries and new publications. In this work we present EnzChemRED, for Enzyme Chemistry Relation Extraction Dataset, a new training and benchmarking dataset to support the development of Natural Language Processing (NLP) methods such as (large) language models that can assist enzyme curation. EnzChemRED consists of 1,210 expert curated PubMed abstracts where enzymes and the chemical reactions they catalyze are annotated using identifiers from the protein knowledgebase UniProtKB and the chemical ontology ChEBI. We show that fine-tuning language models with EnzChemRED significantly boosts their ability to identify proteins and chemicals in text (86.30% F score) and to extract the chemical conversions (86.66% F score) and the enzymes that catalyze those conversions (83.79% F score). We apply our methods to abstracts at PubMed scale to create a draft map of enzyme functions in literature to guide curation efforts in UniProtKB and the reaction knowledgebase Rhea.

摘要

专家策展对于从 FAIR 开放知识库中的科学文献中捕获酶功能知识至关重要，但无法跟上新发现和新出版物的速度。在这项工作中，我们提出了 EnzChemRED，即酶化学关系提取数据集，这是一个新的培训和基准数据集，旨在支持自然语言处理 (NLP) 方法的开发，例如（大型）语言模型，这些模型可以辅助酶的策展。EnzChemRED 由 1210 篇经过专家策展的 PubMed 摘要组成，其中使用来自蛋白质知识库 UniProtKB 和化学本体 ChEBI 的标识符对酶和它们催化的化学反应进行注释。我们表明，使用 EnzChemRED 对语言模型进行微调可以显著提高其在文本中识别蛋白质和化学物质的能力（86.30% F 分数），以及提取化学转化（86.66% F 分数）和催化这些转化的酶（83.79% F 分数）的能力。我们将我们的方法应用于 PubMed 规模的摘要，以创建文献中酶功能的草图地图，以指导 UniProtKB 和反应知识库 Rhea 中的策展工作。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/86ee/11384730/0cc284c1243b/41597_2024_3835_Fig1_HTML.jpg

相似文献

EnzChemRED, a rich enzyme chemistry relation extraction dataset.

Sci Data. 2024 Sep 9;11(1):982. doi: 10.1038/s41597-024-03835-7.

EnzChemRED, a rich enzyme chemistry relation extraction dataset.

ArXiv. 2024 Apr 22:arXiv:2404.14209v1.

Enzyme annotation in UniProtKB using Rhea.

Bioinformatics. 2020 Mar 1;36(6):1896-1901. doi: 10.1093/bioinformatics/btz817.

Updates in Rhea: SPARQLing biochemical reaction data.

Nucleic Acids Res. 2019 Jan 8;47(D1):D596-D600. doi: 10.1093/nar/gky876.

Rhea, the reaction knowledgebase in 2022.

Nucleic Acids Res. 2022 Jan 7;50(D1):D693-D700. doi: 10.1093/nar/gkab1016.

On expert curation and scalability: UniProtKB/Swiss-Prot as a case study.

Bioinformatics. 2017 Nov 1;33(21):3454-3460. doi: 10.1093/bioinformatics/btx439.

Diverse Taxonomies for Diverse Chemistries: Enhanced Representation of Natural Product Metabolism in UniProtKB.

Metabolites. 2021 Jan 12;11(1):48. doi: 10.3390/metabo11010048.

Ensembles of natural language processing systems for portable phenotyping solutions.

J Biomed Inform. 2019 Dec;100:103318. doi: 10.1016/j.jbi.2019.103318. Epub 2019 Oct 23.

Towards pathway curation through literature mining--a case study using PharmGKB.

Pac Symp Biocomput. 2014:352-63.

PICO entity extraction for preclinical animal literature.

Syst Rev. 2022 Sep 30;11(1):209. doi: 10.1186/s13643-022-02074-4.

引用本文的文献

Finding the dark matter: Large language model-based enzyme kinetic data extractor and its validation.

Protein Sci. 2025 Sep;34(9):e70251. doi: 10.1002/pro.70251.

FuncFetch: an LLM-assisted workflow enables mining thousands of enzyme-substrate interactions from published manuscripts.

Bioinformatics. 2024 Dec 26;41(1). doi: 10.1093/bioinformatics/btae756.

UniProt: the Universal Protein Knowledgebase in 2025.

Nucleic Acids Res. 2025 Jan 6;53(D1):D609-D617. doi: 10.1093/nar/gkae1010.

本文引用的文献

PubTator 3.0: an AI-powered literature resource for unlocking biomedical knowledge.

Nucleic Acids Res. 2024 Jul 5;52(W1):W540-W546. doi: 10.1093/nar/gkae235.

Opportunities and challenges for ChatGPT and large language models in biomedicine and health.

Brief Bioinform. 2023 Nov 22;25(1). doi: 10.1093/bib/bbad493.

Overview of DrugProt task at BioCreative VII: data and methods for large-scale text mining and knowledge graph generation of heterogenous chemical-protein relations.

Database (Oxford). 2023 Nov 28;2023. doi: 10.1093/database/baad080.

GNorm2: an improved gene name recognition and normalization system.

Bioinformatics. 2023 Oct 3;39(10). doi: 10.1093/bioinformatics/btad599.

ReactionDataExtractor 2.0: A Deep Learning Approach for Data Extraction from Chemical Reaction Schemes.

J Chem Inf Model. 2023 Oct 9;63(19):6053-6067. doi: 10.1021/acs.jcim.3c00422. Epub 2023 Sep 20.

BioREx: Improving biomedical relation extraction by leveraging heterogeneous datasets.

J Biomed Inform. 2023 Oct;146:104487. doi: 10.1016/j.jbi.2023.104487. Epub 2023 Sep 4.

RxnScribe: A Sequence Generation Model for Reaction Diagram Parsing.

J Chem Inf Model. 2023 Jul 10;63(13):4030-4041. doi: 10.1021/acs.jcim.3c00439. Epub 2023 Jun 27.

S1000: a better taxonomic name corpus for biomedical information extraction.

Bioinformatics. 2023 Jun 1;39(6). doi: 10.1093/bioinformatics/btad369.

AIONER: all-in-one scheme-based biomedical named entity recognition using deep learning.

Bioinformatics. 2023 May 4;39(5). doi: 10.1093/bioinformatics/btad310.

Chemical identification and indexing in full-text articles: an overview of the NLM-Chem track at BioCreative VII.

Database (Oxford). 2023 Mar 7;2023. doi: 10.1093/database/baad005.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

EnzChemRED，一个富含酶化学关系提取的数据集。

EnzChemRED, a rich enzyme chemistry relation extraction dataset.

机构信息

National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, 20894, USA.

Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland.

出版信息

Sci Data. 2024 Sep 9;11(1):982. doi: 10.1038/s41597-024-03835-7.

DOI:10.1038/s41597-024-03835-7

PMID:39251610

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11384730/

Abstract

摘要

EnzChemRED，一个富含酶化学关系提取的数据集。

EnzChemRED, a rich enzyme chemistry relation extraction dataset.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

EnzChemRED，一个富含酶化学关系提取的数据集。

EnzChemRED, a rich enzyme chemistry relation extraction dataset.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献