Bada Michael, Vasilevsky Nicole, Baumgartner William A, Haendel Melissa, Hunter Lawrence E
School of Medicine, Department of Pharmacology, University of Colorado Anschutz Medical Campus, 12801 E. 17th Ave., P.O. Box 6511, MS 8303, Aurora, CO 80045-0511, USA.
Ontology Development Group, Library, Oregon Health & Science University, 318 SW Sam Jackson, Park Road, Portland, OR 97239, USA.
Database (Oxford). 2017 Jan 1;2017. doi: 10.1093/database/bax087.
Gold-standard annotated corpora have become important resources for the training and testing of natural-language-processing (NLP) systems designed to support biocuration efforts, and ontologies are increasingly used to facilitate curational consistency and semantic integration across disparate resources. Bringing together the respective power of these, the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of full-length, open-access biomedical journal articles with extensive manually created syntactic, formatting and semantic markup, was previously created and released. This initial public release has already been used in multiple projects to drive development of systems focused on a variety of biocuration, search, visualization, and semantic and syntactic NLP tasks. Building on its demonstrated utility, we have expanded the CRAFT Corpus with a large set of manually created semantic annotations relying on Uberon, an ontology representing anatomical entities and life-cycle stages of multicellular organisms across species as well as types of multicellular organisms defined in terms of life-cycle stage and sexual characteristics. This newly created set of annotations, which has been added for v2.1 of the corpus, is by far the largest publicly available collection of gold-standard anatomical markup and is the first large-scale effort at manual markup of biomedical text relying on the entirety of an anatomical terminology, as opposed to annotation with a small number of high-level anatomical categories, as performed in previous corpora. In addition to presenting and discussing this newly available resource, we apply it to provide a performance baseline for the automatic annotation of anatomical concepts in biomedical text using a prominent concept recognition system. The full corpus, released with a CC BY 3.0 license, may be downloaded from http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml. Database URL: http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml.
金标准注释语料库已成为用于训练和测试旨在支持生物编目工作的自然语言处理(NLP)系统的重要资源,并且本体越来越多地用于促进不同资源之间的编目一致性和语义整合。汇集这些资源各自的优势,之前创建并发布了科罗拉多丰富注释全文(CRAFT)语料库,这是一个包含全长、开放获取的生物医学期刊文章的集合,带有广泛的人工创建的句法、格式和语义标记。这个初始公开发布版本已在多个项目中用于推动专注于各种生物编目、搜索、可视化以及语义和句法NLP任务的系统开发。基于其已证明的实用性,我们利用Uberon扩展了CRAFT语料库,Uberon是一个本体,代表跨物种多细胞生物体的解剖实体和生命周期阶段以及根据生命周期阶段和性特征定义的多细胞生物体类型。这个新创建的注释集已添加到语料库的v2.1版本中,是目前最大的公开可用的金标准解剖标记集合,并且是首次大规模依靠整个解剖学术语对生物医学文本进行人工标记的努力,这与之前的语料库中使用少量高级解剖类别进行注释不同。除了展示和讨论这个新可用的资源外,我们还将其应用于使用一个著名的概念识别系统为生物医学文本中解剖概念的自动注释提供性能基线。该完整语料库以CC BY 3.0许可发布,可从http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml下载。数据库网址:http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml。