Computational Bioscience Program, U, Colorado School of Medicine, 12801 E 17th Ave, Aurora, MS 8303, CO 80045, USA.
BMC Bioinformatics. 2012 Aug 17;13:207. doi: 10.1186/1471-2105-13-207.
We introduce the linguistic annotation of a corpus of 97 full-text biomedical publications, known as the Colorado Richly Annotated Full Text (CRAFT) corpus. We further assess the performance of existing tools for performing sentence splitting, tokenization, syntactic parsing, and named entity recognition on this corpus.
Many biomedical natural language processing systems demonstrated large differences between their previously published results and their performance on the CRAFT corpus when tested with the publicly available models or rule sets. Trainable systems differed widely with respect to their ability to build high-performing models based on this data.
The finding that some systems were able to train high-performing models based on this corpus is additional evidence, beyond high inter-annotator agreement, that the quality of the CRAFT corpus is high. The overall poor performance of various systems indicates that considerable work needs to be done to enable natural language processing systems to work well when the input is full-text journal articles. The CRAFT corpus provides a valuable resource to the biomedical natural language processing community for evaluation and training of new models for biomedical full text publications.
我们介绍了一个包含 97 篇全文生物医学文献的语料库的语言注释,这个语料库被称为科罗拉多丰富标注全文(CRAFT)语料库。我们进一步评估了现有的工具在这个语料库上进行句子分割、标记化、句法分析和命名实体识别的性能。
许多生物医学自然语言处理系统在使用公开可用的模型或规则集进行测试时,其之前发表的结果与在 CRAFT 语料库上的表现之间存在很大差异。可训练的系统在基于此数据构建高性能模型的能力方面差异很大。
一些系统能够基于这个语料库训练出高性能的模型,这一发现除了表明标注者之间具有高度的一致性之外,还进一步证明了 CRAFT 语料库的质量很高。各种系统整体表现不佳表明,需要做大量工作才能使自然语言处理系统在输入是全文期刊文章时能够很好地工作。CRAFT 语料库为生物医学自然语言处理社区提供了有价值的资源,可用于评估和训练用于生物医学全文出版物的新模型。