National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, United States.
School of Computer Science and Technology, Dalian University of Technology, No.2 Linggong Road, Ganjingzi District, Dalian, Liaoning 116024, China.
Database (Oxford). 2024 Aug 9;2024. doi: 10.1093/database/baae071.
The automatic recognition of biomedical relationships is an important step in the semantic understanding of the information contained in the unstructured text of the published literature. The BioRED track at BioCreative VIII aimed to foster the development of such methods by providing the participants the BioRED-BC8 corpus, a collection of 1000 PubMed documents manually curated for diseases, gene/proteins, chemicals, cell lines, gene variants, and species, as well as pairwise relationships between them which are disease-gene, chemical-gene, disease-variant, gene-gene, chemical-disease, chemical-chemical, chemical-variant, and variant-variant. Furthermore, relationships are categorized into the following semantic categories: positive correlation, negative correlation, binding, conversion, drug interaction, comparison, cotreatment, and association. Unlike most of the previous publicly available corpora, all relationships are expressed at the document level as opposed to the sentence level, and as such, the entities are normalized to the corresponding concept identifiers of the standardized vocabularies, namely, diseases and chemicals are normalized to MeSH, genes (and proteins) to National Center for Biotechnology Information (NCBI) Gene, species to NCBI Taxonomy, cell lines to Cellosaurus, and gene/protein variants to Single Nucleotide Polymorphism Database. Finally, each annotated relationship is categorized as 'novel' depending on whether it is a novel finding or experimental verification in the publication it is expressed in. This distinction helps differentiate novel findings from other relationships in the same text that provides known facts and/or background knowledge. The BioRED-BC8 corpus uses the previous BioRED corpus of 600 PubMed articles as the training dataset and includes a set of newly published 400 articles to serve as the test data for the challenge. All test articles were manually annotated for the BioCreative VIII challenge by expert biocurators at the National Library of Medicine, using the original annotation guidelines, where each article is doubly annotated in a three-round annotation process until full agreement is reached between all curators. This manuscript details the characteristics of the BioRED-BC8 corpus as a critical resource for biomedical named entity recognition and relation extraction. Using this new resource, we have demonstrated advancements in biomedical text-mining algorithm development. Database URL: https://codalab.lisn.upsaclay.fr/competitions/16381.
生物医学关系的自动识别是理解已发表文献中非结构化文本中所包含信息的语义的重要步骤。BioCreative VIII 的 BioRED 专题旨在通过提供 BioRED-BC8 语料库来促进此类方法的发展,该语料库包含 1000 篇经过手动整理的 PubMed 文档,这些文档涉及疾病、基因/蛋白质、化学物质、细胞系、基因变异和物种,以及它们之间的两两关系,这些关系分为疾病-基因、化学物质-基因、疾病-变异、基因-基因、化学物质-疾病、化学物质-化学物质、化学物质-变异和变异-变异。此外,这些关系还分为以下语义类别:正相关、负相关、结合、转化、药物相互作用、比较、联合治疗和关联。与之前大多数公开可用的语料库不同,所有关系都是在文档级别而不是句子级别上表示的,因此,实体被规范化为标准化词汇表的相应概念标识符,即疾病和化学物质被规范化为 MeSH,基因(和蛋白质)被规范化为国家生物技术信息中心(NCBI)基因,物种被规范化为 NCBI 分类学,细胞系被规范化为 Cellosaurus,基因/蛋白质变异被规范化为单核苷酸多态性数据库。最后,根据它们在发表文献中的新发现或实验验证情况,每个注释关系都被归类为“新颖”。这种区分有助于将新颖的发现与同一文本中的其他关系区分开来,因为这些关系提供了已知事实和/或背景知识。BioRED-BC8 语料库使用之前的 600 篇 PubMed 文章的 BioRED 语料库作为训练数据集,并包含一组新发布的 400 篇文章作为挑战赛的测试数据。所有测试文章都由国家医学图书馆的专家生物注释员根据原始注释指南进行了 BioCreative VIII 挑战赛的手动注释,每条文章都经过三轮注释过程进行了双重注释,直到所有注释员之间达成完全一致。本文详细介绍了 BioRED-BC8 语料库作为生物医学命名实体识别和关系提取的关键资源的特点。使用这个新资源,我们已经展示了生物医学文本挖掘算法开发方面的进展。数据库 URL:https://codalab.lisn.upsaclay.fr/competitions/16381。