Oellrich Anika, Collier Nigel, Smedley Damian, Groza Tudor
Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, CB10 1SA, United Kingdom.
EMBL European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, CB10 1SD, United Kingdom; National Institute of Informatics, Tokyo 101-8430, Japan.
PLoS One. 2015 Jan 21;10(1):e0116040. doi: 10.1371/journal.pone.0116040. eCollection 2015.
Electronic health records and scientific articles possess differing linguistic characteristics that may impact the performance of natural language processing tools developed for one or the other. In this paper, we investigate the performance of four extant concept recognition tools: the clinical Text Analysis and Knowledge Extraction System (cTAKES), the National Center for Biomedical Ontology (NCBO) Annotator, the Biomedical Concept Annotation System (BeCAS) and MetaMap. Each of the four concept recognition systems is applied to four different corpora: the i2b2 corpus of clinical documents, a PubMed corpus of Medline abstracts, a clinical trails corpus and the ShARe/CLEF corpus. In addition, we assess the individual system performances with respect to one gold standard annotation set, available for the ShARe/CLEF corpus. Furthermore, we built a silver standard annotation set from the individual systems' output and assess the quality as well as the contribution of individual systems to the quality of the silver standard. Our results demonstrate that mainly the NCBO annotator and cTAKES contribute to the silver standard corpora (F1-measures in the range of 21% to 74%) and their quality (best F1-measure of 33%), independent from the type of text investigated. While BeCAS and MetaMap can contribute to the precision of silver standard annotations (precision of up to 42%), the F1-measure drops when combined with NCBO Annotator and cTAKES due to a low recall. In conclusion, the performances of individual systems need to be improved independently from the text types, and the leveraging strategies to best take advantage of individual systems' annotations need to be revised. The textual content of the PubMed corpus, accession numbers for the clinical trials corpus, and assigned annotations of the four concept recognition systems as well as the generated silver standard annotation sets are available from http://purl.org/phenotype/resources. The textual content of the ShARe/CLEF (https://sites.google.com/site/shareclefehealth/data) and i2b2 (https://i2b2.org/NLP/DataSets/) corpora needs to be requested with the individual corpus providers.
电子健康记录和科学文章具有不同的语言特征,这可能会影响为其中一方开发的自然语言处理工具的性能。在本文中,我们研究了四种现有概念识别工具的性能:临床文本分析与知识提取系统(cTAKES)、国家生物医学本体中心(NCBO)注释器、生物医学概念注释系统(BeCAS)和MetaMap。这四种概念识别系统分别应用于四个不同的语料库:临床文档的i2b2语料库、Medline摘要的PubMed语料库、临床试验语料库和ShARe/CLEF语料库。此外,我们针对一个可用于ShARe/CLEF语料库的黄金标准注释集评估各个系统的性能。此外,我们根据各个系统的输出构建了一个银标准注释集,并评估其质量以及各个系统对银标准质量的贡献。我们的结果表明,主要是NCBO注释器和cTAKES对银标准语料库有贡献(F1值在21%至74%之间)及其质量(最佳F1值为33%),与所研究文本的类型无关。虽然BeCAS和MetaMap可以提高银标准注释的精度(精度高达42%),但由于召回率较低,与NCBO注释器和cTAKES结合时F1值会下降。总之,需要独立于文本类型来提高各个系统的性能,并且需要修订最佳利用各个系统注释的利用策略。PubMed语料库的文本内容、临床试验语料库的 accession编号以及四个概念识别系统的指定注释以及生成的银标准注释集可从http://purl.org/phenotype/resources获取。ShARe/CLEF(https://sites.google.com/site/shareclefehealth/data)和i2b2(https://i2b2.org/NLP/DataSets/)语料库的文本内容需要向各个语料库提供者索取。