Li Jiao, Sun Yueping, Johnson Robin J, Sciaky Daniela, Wei Chih-Hsuan, Leaman Robert, Davis Allan Peter, Mattingly Carolyn J, Wiegers Thomas C, Lu Zhiyong
1Institute of Medical Information, Chinese Academy of Medical Sciences, Beijing 100020, China.
2Department of Biological Sciences and the Center for Human Health and the Environment, North Carolina State University, Raleigh, NC 27695, USA.
Database (Oxford). 2016 May 9;2016. doi: 10.1093/database/baw068. Print 2016.
Community-run, formal evaluations and manually annotated text corpora are critically important for advancing biomedical text-mining research. Recently in BioCreative V, a new challenge was organized for the tasks of disease named entity recognition (DNER) and chemical-induced disease (CID) relation extraction. Given the nature of both tasks, a test collection is required to contain both disease/chemical annotations and relation annotations in the same set of articles. Despite previous efforts in biomedical corpus construction, none was found to be sufficient for the task. Thus, we developed our own corpus called BC5CDR during the challenge by inviting a team of Medical Subject Headings (MeSH) indexers for disease/chemical entity annotation and Comparative Toxicogenomics Database (CTD) curators for CID relation annotation. To ensure high annotation quality and productivity, detailed annotation guidelines and automatic annotation tools were provided. The resulting BC5CDR corpus consists of 1500 PubMed articles with 4409 annotated chemicals, 5818 diseases and 3116 chemical-disease interactions. Each entity annotation includes both the mention text spans and normalized concept identifiers, using MeSH as the controlled vocabulary. To ensure accuracy, the entities were first captured independently by two annotators followed by a consensus annotation: The average inter-annotator agreement (IAA) scores were 87.49% and 96.05% for the disease and chemicals, respectively, in the test set according to the Jaccard similarity coefficient. Our corpus was successfully used for the BioCreative V challenge tasks and should serve as a valuable resource for the text-mining research community.Database URL: http://www.biocreative.org/tasks/biocreative-v/track-3-cdr/.
社区运行的正式评估和人工标注的文本语料库对于推进生物医学文本挖掘研究至关重要。最近在生物创意挑战赛V中,针对疾病命名实体识别(DNER)和化学诱导疾病(CID)关系提取任务组织了一项新的挑战。鉴于这两项任务的性质,需要一个测试集在同一组文章中同时包含疾病/化学注释和关系注释。尽管之前在生物医学语料库建设方面做出了努力,但发现没有一个语料库足以完成这项任务。因此,我们在挑战赛期间开发了自己的语料库BC5CDR,邀请了一组医学主题词表(MeSH)索引员进行疾病/化学实体注释,并邀请比较毒理基因组学数据库(CTD)策展人进行CID关系注释。为确保高注释质量和效率,提供了详细的注释指南和自动注释工具。所得的BC5CDR语料库由1500篇PubMed文章组成,其中有4409个注释的化学物质、5818种疾病和3116种化学-疾病相互作用。每个实体注释都包括提及文本跨度和标准化概念标识符,使用MeSH作为控制词汇。为确保准确性,实体首先由两名注释员独立捕获,然后进行一致性注释:根据杰卡德相似系数,测试集中疾病和化学物质的注释者间平均一致性(IAA)分数分别为87.49%和96.05%。我们的语料库已成功用于生物创意挑战赛V的任务,应为文本挖掘研究社区提供宝贵资源。数据库网址:http://www.biocreative.org/tasks/biocreative-v/track-3-cdr/