• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

生物创意 VIII 挑战赛和研讨会的 BioRED 专题生物医学关系语料库。

The biomedical relationship corpus of the BioRED track at the BioCreative VIII challenge and workshop.

机构信息

National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, United States.

School of Computer Science and Technology, Dalian University of Technology, No.2 Linggong Road, Ganjingzi District, Dalian, Liaoning 116024, China.

出版信息

Database (Oxford). 2024 Aug 9;2024. doi: 10.1093/database/baae071.

DOI:10.1093/database/baae071
PMID:39126204
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11315767/
Abstract

The automatic recognition of biomedical relationships is an important step in the semantic understanding of the information contained in the unstructured text of the published literature. The BioRED track at BioCreative VIII aimed to foster the development of such methods by providing the participants the BioRED-BC8 corpus, a collection of 1000 PubMed documents manually curated for diseases, gene/proteins, chemicals, cell lines, gene variants, and species, as well as pairwise relationships between them which are disease-gene, chemical-gene, disease-variant, gene-gene, chemical-disease, chemical-chemical, chemical-variant, and variant-variant. Furthermore, relationships are categorized into the following semantic categories: positive correlation, negative correlation, binding, conversion, drug interaction, comparison, cotreatment, and association. Unlike most of the previous publicly available corpora, all relationships are expressed at the document level as opposed to the sentence level, and as such, the entities are normalized to the corresponding concept identifiers of the standardized vocabularies, namely, diseases and chemicals are normalized to MeSH, genes (and proteins) to National Center for Biotechnology Information (NCBI) Gene, species to NCBI Taxonomy, cell lines to Cellosaurus, and gene/protein variants to Single Nucleotide Polymorphism Database. Finally, each annotated relationship is categorized as 'novel' depending on whether it is a novel finding or experimental verification in the publication it is expressed in. This distinction helps differentiate novel findings from other relationships in the same text that provides known facts and/or background knowledge. The BioRED-BC8 corpus uses the previous BioRED corpus of 600 PubMed articles as the training dataset and includes a set of newly published 400 articles to serve as the test data for the challenge. All test articles were manually annotated for the BioCreative VIII challenge by expert biocurators at the National Library of Medicine, using the original annotation guidelines, where each article is doubly annotated in a three-round annotation process until full agreement is reached between all curators. This manuscript details the characteristics of the BioRED-BC8 corpus as a critical resource for biomedical named entity recognition and relation extraction. Using this new resource, we have demonstrated advancements in biomedical text-mining algorithm development. Database URL: https://codalab.lisn.upsaclay.fr/competitions/16381.

摘要

生物医学关系的自动识别是理解已发表文献中非结构化文本中所包含信息的语义的重要步骤。BioCreative VIII 的 BioRED 专题旨在通过提供 BioRED-BC8 语料库来促进此类方法的发展,该语料库包含 1000 篇经过手动整理的 PubMed 文档,这些文档涉及疾病、基因/蛋白质、化学物质、细胞系、基因变异和物种,以及它们之间的两两关系,这些关系分为疾病-基因、化学物质-基因、疾病-变异、基因-基因、化学物质-疾病、化学物质-化学物质、化学物质-变异和变异-变异。此外,这些关系还分为以下语义类别:正相关、负相关、结合、转化、药物相互作用、比较、联合治疗和关联。与之前大多数公开可用的语料库不同,所有关系都是在文档级别而不是句子级别上表示的,因此,实体被规范化为标准化词汇表的相应概念标识符,即疾病和化学物质被规范化为 MeSH,基因(和蛋白质)被规范化为国家生物技术信息中心(NCBI)基因,物种被规范化为 NCBI 分类学,细胞系被规范化为 Cellosaurus,基因/蛋白质变异被规范化为单核苷酸多态性数据库。最后,根据它们在发表文献中的新发现或实验验证情况,每个注释关系都被归类为“新颖”。这种区分有助于将新颖的发现与同一文本中的其他关系区分开来,因为这些关系提供了已知事实和/或背景知识。BioRED-BC8 语料库使用之前的 600 篇 PubMed 文章的 BioRED 语料库作为训练数据集,并包含一组新发布的 400 篇文章作为挑战赛的测试数据。所有测试文章都由国家医学图书馆的专家生物注释员根据原始注释指南进行了 BioCreative VIII 挑战赛的手动注释,每条文章都经过三轮注释过程进行了双重注释,直到所有注释员之间达成完全一致。本文详细介绍了 BioRED-BC8 语料库作为生物医学命名实体识别和关系提取的关键资源的特点。使用这个新资源,我们已经展示了生物医学文本挖掘算法开发方面的进展。数据库 URL:https://codalab.lisn.upsaclay.fr/competitions/16381。

相似文献

1
The biomedical relationship corpus of the BioRED track at the BioCreative VIII challenge and workshop.生物创意 VIII 挑战赛和研讨会的 BioRED 专题生物医学关系语料库。
Database (Oxford). 2024 Aug 9;2024. doi: 10.1093/database/baae071.
2
The overview of the BioRED (Biomedical Relation Extraction Dataset) track at BioCreative VIII.生物创意 VIII 中生物医学关系提取数据集(BioRED)赛道概述。
Database (Oxford). 2024 Aug 8;2024. doi: 10.1093/database/baae069.
3
NLM-Chem-BC7: manually annotated full-text resources for chemical entity annotation and indexing in biomedical articles.NLM-Chem-BC7:用于生物医学文章中化学实体注释和索引的人工标注全文资源。
Database (Oxford). 2022 Dec 1;2022. doi: 10.1093/database/baac102.
4
BioCreative V CDR task corpus: a resource for chemical disease relation extraction.生物创意V化学疾病关系提取任务语料库:化学疾病关系提取的资源。
Database (Oxford). 2016 May 9;2016. doi: 10.1093/database/baw068. Print 2016.
5
The BioC-BioGRID corpus: full text articles annotated for curation of protein-protein and genetic interactions.BioC-BioGRID语料库:为蛋白质-蛋白质和基因相互作用的编目而注释的全文文章。
Database (Oxford). 2017 Jan 10;2017. doi: 10.1093/database/baw147. Print 2017.
6
BioRED: a rich biomedical relation extraction dataset.BioRED:一个丰富的生物医学关系抽取数据集。
Brief Bioinform. 2022 Sep 20;23(5). doi: 10.1093/bib/bbac282.
7
Chemical identification and indexing in full-text articles: an overview of the NLM-Chem track at BioCreative VII.全文文章中的化学物质鉴定与标引:NLM-Chem 在 BioCreative VII 挑战赛中的概述
Database (Oxford). 2023 Mar 7;2023. doi: 10.1093/database/baad005.
8
RCorp: a resource for chemical disease semantic extraction in Chinese.RCorp:一个用于中文化学疾病语义提取的资源。
BMC Med Inform Decis Mak. 2019 Dec 5;19(Suppl 5):234. doi: 10.1186/s12911-019-0936-3.
9
Training and evaluation corpora for the extraction of causal relationships encoded in biological expression language (BEL).用于提取生物表达语言(BEL)中编码的因果关系的训练和评估语料库。
Database (Oxford). 2016 Aug 23;2016. doi: 10.1093/database/baw113. Print 2016.
10
NCBI disease corpus: a resource for disease name recognition and concept normalization.NCBI疾病语料库:一种用于疾病名称识别和概念规范化的资源。
J Biomed Inform. 2014 Feb;47:1-10. doi: 10.1016/j.jbi.2013.12.006. Epub 2014 Jan 3.

引用本文的文献

1
Enhancing biomedical relation extraction with directionality.通过方向性增强生物医学关系提取
Bioinformatics. 2025 Jul 1;41(Supplement_1):i68-i76. doi: 10.1093/bioinformatics/btaf226.
2
Integrating deep learning architectures for enhanced biomedical relation extraction: a pipeline approach.深度学习架构在增强生物医学关系抽取中的应用:一种流水线方法。
Database (Oxford). 2024 Aug 28;2024. doi: 10.1093/database/baae079.
3
The overview of the BioRED (Biomedical Relation Extraction Dataset) track at BioCreative VIII.生物创意 VIII 中生物医学关系提取数据集(BioRED)赛道概述。
Database (Oxford). 2024 Aug 8;2024. doi: 10.1093/database/baae069.

本文引用的文献

1
The overview of the BioRED (Biomedical Relation Extraction Dataset) track at BioCreative VIII.生物创意 VIII 中生物医学关系提取数据集(BioRED)赛道概述。
Database (Oxford). 2024 Aug 8;2024. doi: 10.1093/database/baae069.
2
PubTator 3.0: an AI-powered literature resource for unlocking biomedical knowledge.PubTator 3.0:一款人工智能驱动的文献资源,用于解锁生物医学知识。
Nucleic Acids Res. 2024 Jul 5;52(W1):W540-W546. doi: 10.1093/nar/gkae235.
3
Opportunities and challenges for ChatGPT and large language models in biomedicine and health.ChatGPT 和大型语言模型在生物医学和健康领域的机遇与挑战。
Brief Bioinform. 2023 Nov 22;25(1). doi: 10.1093/bib/bbad493.
4
Overview of DrugProt task at BioCreative VII: data and methods for large-scale text mining and knowledge graph generation of heterogenous chemical-protein relations.DrugProt 任务概述在 BioCreative VII 上:大规模文本挖掘和异构化学-蛋白质关系知识图生成的数据和方法。
Database (Oxford). 2023 Nov 28;2023. doi: 10.1093/database/baad080.
5
BioREx: Improving biomedical relation extraction by leveraging heterogeneous datasets.BioREx:通过利用异构数据集改进生物医学关系提取
J Biomed Inform. 2023 Oct;146:104487. doi: 10.1016/j.jbi.2023.104487. Epub 2023 Sep 4.
6
NLM-Chem-BC7: manually annotated full-text resources for chemical entity annotation and indexing in biomedical articles.NLM-Chem-BC7:用于生物医学文章中化学实体注释和索引的人工标注全文资源。
Database (Oxford). 2022 Dec 1;2022. doi: 10.1093/database/baac102.
7
tmVar 3.0: an improved variant concept recognition and normalization tool.tmVar 3.0:一种改进的变异概念识别和标准化工具。
Bioinformatics. 2022 Sep 15;38(18):4449-4451. doi: 10.1093/bioinformatics/btac537.
8
BioRED: a rich biomedical relation extraction dataset.BioRED:一个丰富的生物医学关系抽取数据集。
Brief Bioinform. 2022 Sep 20;23(5). doi: 10.1093/bib/bbac282.
9
Database resources of the national center for biotechnology information.国家生物技术信息中心数据库资源。
Nucleic Acids Res. 2022 Jan 7;50(D1):D20-D26. doi: 10.1093/nar/gkab1112.
10
PharmGKB, an Integrated Resource of Pharmacogenomic Knowledge.PharmGKB,一个综合性的药物基因组学知识库。
Curr Protoc. 2021 Aug;1(8):e226. doi: 10.1002/cpz1.226.