Islamaj Dogan Rezarta, Kim Sun, Chatr-Aryamontri Andrew, Chang Christie S, Oughtred Rose, Rust Jennifer, Wilbur W John, Comeau Donald C, Dolinski Kara, Tyers Mike
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD20894, USA.
Institute for Research in Immunology and Cancer, Université de Montréal, Canada Montréal, QC H3C 3J7.
Database (Oxford). 2017 Jan 10;2017. doi: 10.1093/database/baw147. Print 2017.
A great deal of information on the molecular genetics and biochemistry of model organisms has been reported in the scientific literature. However, this data is typically described in free text form and is not readily amenable to computational analyses. To this end, the BioGRID database systematically curates the biomedical literature for genetic and protein interaction data. This data is provided in a standardized computationally tractable format and includes structured annotation of experimental evidence. BioGRID curation necessarily involves substantial human effort by expert curators who must read each publication to extract the relevant information. Computational text-mining methods offer the potential to augment and accelerate manual curation. To facilitate the development of practical text-mining strategies, a new challenge was organized in BioCreative V for the BioC task, the collaborative Biocurator Assistant Task. This was a non-competitive, cooperative task in which the participants worked together to build BioC-compatible modules into an integrated pipeline to assist BioGRID curators. As an integral part of this task, a test collection of full text articles was developed that contained both biological entity annotations (gene/protein and organism/species) and molecular interaction annotations (protein-protein and genetic interactions (PPIs and GIs)). This collection, which we call the BioC-BioGRID corpus, was annotated by four BioGRID curators over three rounds of annotation and contains 120 full text articles curated in a dataset representing two major model organisms, namely budding yeast and human. The BioC-BioGRID corpus contains annotations for 6409 mentions of genes and their Entrez Gene IDs, 186 mentions of organism names and their NCBI Taxonomy IDs, 1867 mentions of PPIs and 701 annotations of PPI experimental evidence statements, 856 mentions of GIs and 399 annotations of GI evidence statements. The purpose, characteristics and possible future uses of the BioC-BioGRID corpus are detailed in this report.Database URL: http://bioc.sourceforge.net/BioC-BioGRID.html.
科学文献中已报道了大量关于模式生物分子遗传学和生物化学的信息。然而,这些数据通常以自由文本形式描述,不易进行计算分析。为此,BioGRID数据库系统地整理生物医学文献以获取遗传和蛋白质相互作用数据。这些数据以标准化的、便于计算处理的格式提供,并包括实验证据的结构化注释。BioGRID整理必然需要专家整理人员付出大量人力,他们必须阅读每一篇出版物以提取相关信息。计算文本挖掘方法有潜力增强和加速人工整理。为促进实用文本挖掘策略的开发,在生物创意V中针对生物C任务组织了一项新挑战,即协作生物整理助手任务。这是一项非竞争性的合作任务,参与者共同努力将与生物C兼容的模块构建到一个集成管道中,以协助BioGRID整理人员。作为这项任务的一个组成部分,开发了一个全文文章测试集,其中包含生物实体注释(基因/蛋白质和生物体/物种)和分子相互作用注释(蛋白质-蛋白质和遗传相互作用(PPI和GI))。这个我们称为BioC-BioGRID语料库的集合由四名BioGRID整理人员经过三轮注释进行注释,包含120篇全文文章,这些文章整理在一个代表两种主要模式生物(即芽殖酵母和人类)的数据集中。BioC-BioGRID语料库包含6409个基因提及及其Entrez基因ID的注释、186个生物体名称及其NCBI分类ID的提及、1867个PPI提及以及701个PPI实验证据陈述的注释、856个GI提及以及399个GI证据陈述的注释。本报告详细介绍了BioC-BioGRID语料库的目的、特点和未来可能的用途。数据库网址:http://bioc.sourceforge.net/BioC-BioGRID.html。