语料库全文期刊文章是一种强大的评估工具，可用于揭示生物医学自然语言处理工具性能的差异。

A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools.

机构信息

Computational Bioscience Program, U, Colorado School of Medicine, 12801 E 17th Ave, Aurora, MS 8303, CO 80045, USA.

出版信息

BMC Bioinformatics. 2012 Aug 17;13:207. doi: 10.1186/1471-2105-13-207.

DOI:10.1186/1471-2105-13-207

PMID:22901054

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3483229/

Abstract

BACKGROUND

We introduce the linguistic annotation of a corpus of 97 full-text biomedical publications, known as the Colorado Richly Annotated Full Text (CRAFT) corpus. We further assess the performance of existing tools for performing sentence splitting, tokenization, syntactic parsing, and named entity recognition on this corpus.

RESULTS

Many biomedical natural language processing systems demonstrated large differences between their previously published results and their performance on the CRAFT corpus when tested with the publicly available models or rule sets. Trainable systems differed widely with respect to their ability to build high-performing models based on this data.

CONCLUSIONS

The finding that some systems were able to train high-performing models based on this corpus is additional evidence, beyond high inter-annotator agreement, that the quality of the CRAFT corpus is high. The overall poor performance of various systems indicates that considerable work needs to be done to enable natural language processing systems to work well when the input is full-text journal articles. The CRAFT corpus provides a valuable resource to the biomedical natural language processing community for evaluation and training of new models for biomedical full text publications.

摘要

背景

我们介绍了一个包含 97 篇全文生物医学文献的语料库的语言注释，这个语料库被称为科罗拉多丰富标注全文（CRAFT）语料库。我们进一步评估了现有的工具在这个语料库上进行句子分割、标记化、句法分析和命名实体识别的性能。

结果

许多生物医学自然语言处理系统在使用公开可用的模型或规则集进行测试时，其之前发表的结果与在 CRAFT 语料库上的表现之间存在很大差异。可训练的系统在基于此数据构建高性能模型的能力方面差异很大。

结论

一些系统能够基于这个语料库训练出高性能的模型，这一发现除了表明标注者之间具有高度的一致性之外，还进一步证明了 CRAFT 语料库的质量很高。各种系统整体表现不佳表明，需要做大量工作才能使自然语言处理系统在输入是全文期刊文章时能够很好地工作。CRAFT 语料库为生物医学自然语言处理社区提供了有价值的资源，可用于评估和训练用于生物医学全文出版物的新模型。

相似文献

A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools.语料库全文期刊文章是一种强大的评估工具，可用于揭示生物医学自然语言处理工具性能的差异。

BMC Bioinformatics. 2012 Aug 17;13:207. doi: 10.1186/1471-2105-13-207.

Coreference annotation and resolution in the Colorado Richly Annotated Full Text (CRAFT) corpus of biomedical journal articles.科罗拉多生物医学期刊文章丰富注释全文（CRAFT）语料库中的共指标注与消解

BMC Bioinformatics. 2017 Aug 17;18(1):372. doi: 10.1186/s12859-017-1775-9.

Concept annotation in the CRAFT corpus.概念标注在 CRAFT 语料库中。

BMC Bioinformatics. 2012 Jul 9;13:161. doi: 10.1186/1471-2105-13-161.

Biomedical and clinical English model packages for the Stanza Python NLP library.适用于Stanza Python自然语言处理库的生物医学和临床英语模型包。

J Am Med Inform Assoc. 2021 Aug 13;28(9):1892-1899. doi: 10.1093/jamia/ocab090.

Building a comprehensive syntactic and semantic corpus of Chinese clinical texts.构建中文临床文本的综合句法和语义语料库。

J Biomed Inform. 2017 May;69:203-217. doi: 10.1016/j.jbi.2017.04.006. Epub 2017 Apr 9.

The biomedical discourse relation bank.生物医学话语关系库。

BMC Bioinformatics. 2011 May 23;12:188. doi: 10.1186/1471-2105-12-188.

Gold-standard ontology-based anatomical annotation in the CRAFT Corpus.CRAFT语料库中基于金标准本体的解剖学标注

Database (Oxford). 2017 Jan 1;2017. doi: 10.1093/database/bax087.

NCBI disease corpus: a resource for disease name recognition and concept normalization.NCBI疾病语料库：一种用于疾病名称识别和概念规范化的资源。

J Biomed Inform. 2014 Feb;47:1-10. doi: 10.1016/j.jbi.2013.12.006. Epub 2014 Jan 3.

PhenoDEF: a corpus for annotating sentences with information of phenotype definitions in biomedical literature.PhenoDEF：一个用于在生物医学文献中注释具有表型定义信息的句子的语料库。

J Biomed Semantics. 2022 Jun 11;13(1):17. doi: 10.1186/s13326-022-00272-6.

HunFlair2 in a cross-corpus evaluation of biomedical named entity recognition and normalization tools.HunFlair2 在生物医学命名实体识别和标准化工具的跨语料库评估中的应用。

Bioinformatics. 2024 Oct 1;40(10). doi: 10.1093/bioinformatics/btae564.

引用本文的文献

Parallel sequence tagging for concept recognition.并行序列标注用于概念识别。

BMC Bioinformatics. 2022 Mar 24;22(Suppl 1):623. doi: 10.1186/s12859-021-04511-y.

GeneCup: mining PubMed and GWAS catalog for gene-keyword relationships.GeneCup：从 PubMed 和 GWAS 目录中挖掘基因-关键词关系。

G3 (Bethesda). 2022 May 6;12(5). doi: 10.1093/g3journal/jkac059.

Examining linguistic shifts between preprints and publications.考察预印本和出版物之间的语言变化。

PLoS Biol. 2022 Feb 1;20(2):e3001470. doi: 10.1371/journal.pbio.3001470. eCollection 2022 Feb.

Do medicine and cell biology talk to each other? A study of vocabulary similarities between fields.医学和细胞生物学有交流吗？对领域间词汇相似性的研究。

Braz J Med Biol Res. 2021 Oct 18;54(12):e11728. doi: 10.1590/1414-431X2021e11728. eCollection 2021.

ECO-CollecTF: A Corpus of Annotated Evidence-Based Assertions in Biomedical Manuscripts.ECO-CollecTF：生物医学手稿中带注释的循证断言语料库。

Front Res Metr Anal. 2021 Jul 13;6:674205. doi: 10.3389/frma.2021.674205. eCollection 2021.

Dependency parsing of biomedical text with BERT.基于 BERT 的生物医学文本依存句法分析。

BMC Bioinformatics. 2020 Dec 29;21(Suppl 23):580. doi: 10.1186/s12859-020-03905-8.

Gold-standard ontology-based anatomical annotation in the CRAFT Corpus.CRAFT语料库中基于金标准本体的解剖学标注

Database (Oxford). 2017 Jan 1;2017. doi: 10.1093/database/bax087.

ProtFus: A Comprehensive Method Characterizing Protein-Protein Interactions of Fusion Proteins.ProtFus：一种全面的融合蛋白蛋白质相互作用特征分析方法。

PLoS Comput Biol. 2019 Aug 22;15(8):e1007239. doi: 10.1371/journal.pcbi.1007239. eCollection 2019 Aug.

Doc2Hpo: a web application for efficient and accurate HPO concept curation.Doc2Hpo：一个用于高效准确的 HPO 概念编纂的网络应用程序。

Nucleic Acids Res. 2019 Jul 2;47(W1):W566-W570. doi: 10.1093/nar/gkz386.

OsteoporosAtlas: a human osteoporosis-related gene database.骨质疏松症图谱：一个人类骨质疏松相关基因数据库。

PeerJ. 2019 Apr 26;7:e6778. doi: 10.7717/peerj.6778. eCollection 2019.

本文引用的文献

Concept annotation in the CRAFT corpus.概念标注在 CRAFT 语料库中。

BMC Bioinformatics. 2012 Jul 9;13:161. doi: 10.1186/1471-2105-13-161.

Parenthetically speaking: classifying the contents of parentheses for text mining.顺便说一句：对括号内的内容进行文本挖掘分类。

AMIA Annu Symp Proc. 2011;2011:267-72. Epub 2011 Oct 22.

Entrez Gene: gene-centered information at NCBI.Entrez基因：美国国立医学图书馆国家生物技术信息中心的基因中心信息。

Nucleic Acids Res. 2011 Jan;39(Database issue):D52-7. doi: 10.1093/nar/gkq1237. Epub 2010 Nov 28.

Database resources of the National Center for Biotechnology Information.美国国立生物技术信息中心的数据库资源。

Nucleic Acids Res. 2011 Jan;39(Database issue):D38-51. doi: 10.1093/nar/gkq1172. Epub 2010 Nov 21.

Improving the inter-corpora compatibility for protein annotations.

J Bioinform Comput Biol. 2010 Oct;8(5):901-16. doi: 10.1142/s0219720010004999.

The structural and content aspects of abstracts versus bodies of full text journal articles are different.文摘的结构和内容方面与全文期刊文章的不同。

BMC Bioinformatics. 2010 Sep 29;11:492. doi: 10.1186/1471-2105-11-492.

Identifying relationships among genomic disease regions: predicting genes at pathogenic SNP associations and rare deletions.识别基因组疾病区域之间的关系：预测致病性单核苷酸多态性关联和罕见缺失中的基因。

PLoS Genet. 2009 Jun;5(6):e1000534. doi: 10.1371/journal.pgen.1000534. Epub 2009 Jun 26.

The textual characteristics of traditional and Open Access scientific journals are similar.传统科学期刊和开放获取科学期刊的文本特征相似。

BMC Bioinformatics. 2009 Jun 15;10:183. doi: 10.1186/1471-2105-10-183.

Evaluating contributions of natural language parsers to protein-protein interaction extraction.评估自然语言解析器对蛋白质-蛋白质相互作用提取的贡献。

Bioinformatics. 2009 Feb 1;25(3):394-400. doi: 10.1093/bioinformatics/btn631. Epub 2008 Dec 9.

Overview of BioCreative II gene normalization.生物创意II基因标准化概述。

Genome Biol. 2008;9 Suppl 2(Suppl 2):S3. doi: 10.1186/gb-2008-9-s2-s3. Epub 2008 Sep 1.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验