• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

概念标注在 CRAFT 语料库中。

Concept annotation in the CRAFT corpus.

机构信息

Department of Pharmacology, University of Colorado Anschutz Medical Campus, Aurora, CO, USA.

出版信息

BMC Bioinformatics. 2012 Jul 9;13:161. doi: 10.1186/1471-2105-13-161.

DOI:10.1186/1471-2105-13-161
PMID:22776079
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3476437/
Abstract

BACKGROUND

Manually annotated corpora are critical for the training and evaluation of automated methods to identify concepts in biomedical text.

RESULTS

This paper presents the concept annotations of the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-length, open-access biomedical journal articles that have been annotated both semantically and syntactically to serve as a research resource for the biomedical natural-language-processing (NLP) community. CRAFT identifies all mentions of nearly all concepts from nine prominent biomedical ontologies and terminologies: the Cell Type Ontology, the Chemical Entities of Biological Interest ontology, the NCBI Taxonomy, the Protein Ontology, the Sequence Ontology, the entries of the Entrez Gene database, and the three subontologies of the Gene Ontology. The first public release includes the annotations for 67 of the 97 articles, reserving two sets of 15 articles for future text-mining competitions (after which these too will be released). Concept annotations were created based on a single set of guidelines, which has enabled us to achieve consistently high interannotator agreement.

CONCLUSIONS

As the initial 67-article release contains more than 560,000 tokens (and the full set more than 790,000 tokens), our corpus is among the largest gold-standard annotated biomedical corpora. Unlike most others, the journal articles that comprise the corpus are drawn from diverse biomedical disciplines and are marked up in their entirety. Additionally, with a concept-annotation count of nearly 100,000 in the 67-article subset (and more than 140,000 in the full collection), the scale of conceptual markup is also among the largest of comparable corpora. The concept annotations of the CRAFT Corpus have the potential to significantly advance biomedical text mining by providing a high-quality gold standard for NLP systems. The corpus, annotation guidelines, and other associated resources are freely available at http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml.

摘要

背景

人工标注语料库对于训练和评估自动方法识别生物医学文本中的概念至关重要。

结果

本文介绍了科罗拉多州丰富标注全文(CRAFT)语料库的概念标注,这是一个包含 97 篇全文、开放获取的生物医学期刊文章的集合,这些文章已经进行了语义和句法标注,作为生物医学自然语言处理(NLP)社区的研究资源。CRAFT 从九个著名的生物医学本体和术语学中识别出几乎所有概念的提及:细胞类型本体、生物化学实体本体、NCBI 分类法、蛋白质本体、序列本体、Entrez Gene 数据库条目,以及基因本体的三个子本体。第一个公开版本包括 97 篇文章中的 67 篇的注释,保留了两套 15 篇文章用于未来的文本挖掘竞赛(之后也将发布这些文章)。概念注释是基于一套单一的指南创建的,这使得我们能够实现始终如一的高注释者间一致性。

结论

由于初始的 67 篇文章发布版本包含超过 560,000 个标记(并且完整版本包含超过 790,000 个标记),我们的语料库是最大的黄金标准生物医学语料库之一。与大多数其他语料库不同,构成语料库的期刊文章来自不同的生物医学学科,并进行了完整的标注。此外,在 67 篇文章的子集(以及完整集合中的超过 140,000 个)中,概念标记的规模也是可比语料库中最大的之一。CRAFT 语料库的概念注释有可能通过为 NLP 系统提供高质量的黄金标准来显著推进生物医学文本挖掘。语料库、注释指南和其他相关资源可在 http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml 上免费获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e989/3476437/0fcf8cd55084/1471-2105-13-161-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e989/3476437/7742eba4a38a/1471-2105-13-161-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e989/3476437/0fcf8cd55084/1471-2105-13-161-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e989/3476437/7742eba4a38a/1471-2105-13-161-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e989/3476437/0fcf8cd55084/1471-2105-13-161-2.jpg

相似文献

1
Concept annotation in the CRAFT corpus.概念标注在 CRAFT 语料库中。
BMC Bioinformatics. 2012 Jul 9;13:161. doi: 10.1186/1471-2105-13-161.
2
Gold-standard ontology-based anatomical annotation in the CRAFT Corpus.CRAFT语料库中基于金标准本体的解剖学标注
Database (Oxford). 2017 Jan 1;2017. doi: 10.1093/database/bax087.
3
NCBI disease corpus: a resource for disease name recognition and concept normalization.NCBI疾病语料库:一种用于疾病名称识别和概念规范化的资源。
J Biomed Inform. 2014 Feb;47:1-10. doi: 10.1016/j.jbi.2013.12.006. Epub 2014 Jan 3.
4
Coreference annotation and resolution in the Colorado Richly Annotated Full Text (CRAFT) corpus of biomedical journal articles.科罗拉多生物医学期刊文章丰富注释全文(CRAFT)语料库中的共指标注与消解
BMC Bioinformatics. 2017 Aug 17;18(1):372. doi: 10.1186/s12859-017-1775-9.
5
Desiderata for ontologies to be used in semantic annotation of biomedical documents.用于生物医学文献语义标注的本体的需求。
J Biomed Inform. 2011 Feb;44(1):94-101. doi: 10.1016/j.jbi.2010.10.002. Epub 2010 Oct 26.
6
Construction of an annotated corpus to support biomedical information extraction.构建带注释语料库以支持生物医学信息抽取。
BMC Bioinformatics. 2009 Oct 23;10:349. doi: 10.1186/1471-2105-10-349.
7
A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools.语料库全文期刊文章是一种强大的评估工具,可用于揭示生物医学自然语言处理工具性能的差异。
BMC Bioinformatics. 2012 Aug 17;13:207. doi: 10.1186/1471-2105-13-207.
8
NLM-Chem-BC7: manually annotated full-text resources for chemical entity annotation and indexing in biomedical articles.NLM-Chem-BC7:用于生物医学文章中化学实体注释和索引的人工标注全文资源。
Database (Oxford). 2022 Dec 1;2022. doi: 10.1093/database/baac102.
9
Assessment of disease named entity recognition on a corpus of annotated sentences.基于带注释句子语料库的疾病命名实体识别评估。
BMC Bioinformatics. 2008 Apr 11;9 Suppl 3(Suppl 3):S3. doi: 10.1186/1471-2105-9-S3-S3.
10
Gene Ontology synonym generation rules lead to increased performance in biomedical concept recognition.基因本体同义词生成规则可提高生物医学概念识别的性能。
J Biomed Semantics. 2016 Sep 9;7(1):52. doi: 10.1186/s13326-016-0096-7.

引用本文的文献

1
MilkOligoCorpus: A semantically annotated resource for knowledge extraction on mammalian milk oligosaccharides.乳寡糖语料库:用于提取哺乳动物乳寡糖知识的语义标注资源。
PLoS One. 2025 Aug 4;20(8):e0319729. doi: 10.1371/journal.pone.0319729. eCollection 2025.
2
Hospital Re-Admission Prediction Using Named Entity Recognition and Explainable Machine Learning.使用命名实体识别和可解释机器学习的医院再入院预测
Diagnostics (Basel). 2024 Sep 27;14(19):2151. doi: 10.3390/diagnostics14192151.
3
Automated annotation of scientific texts for ML-based keyphrase extraction and validation.

本文引用的文献

1
A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools.语料库全文期刊文章是一种强大的评估工具,可用于揭示生物医学自然语言处理工具性能的差异。
BMC Bioinformatics. 2012 Aug 17;13:207. doi: 10.1186/1471-2105-13-207.
2
Parenthetically speaking: classifying the contents of parentheses for text mining.顺便说一句:对括号内的内容进行文本挖掘分类。
AMIA Annu Symp Proc. 2011;2011:267-72. Epub 2011 Oct 22.
3
The Gene Ontology: enhancements for 2011.基因本体论:2011 年的增强。
用于基于机器学习的关键短语提取与验证的科学文本自动标注
Database (Oxford). 2024 Sep 27;2024. doi: 10.1093/database/baae093.
4
A natural language processing system for the efficient extraction of cell markers.一种用于高效提取细胞标记物的自然语言处理系统。
Sci Rep. 2024 Sep 11;14(1):21183. doi: 10.1038/s41598-024-72204-6.
5
The biomedical relationship corpus of the BioRED track at the BioCreative VIII challenge and workshop.生物创意 VIII 挑战赛和研讨会的 BioRED 专题生物医学关系语料库。
Database (Oxford). 2024 Aug 9;2024. doi: 10.1093/database/baae071.
6
Explanatory argumentation in natural language for correct and incorrect medical diagnoses.自然语言中的解释性论证用于正确和错误的医学诊断。
J Biomed Semantics. 2024 May 30;15(1):8. doi: 10.1186/s13326-024-00306-1.
7
Ensemble pretrained language models to extract biomedical knowledge from literature.基于预训练语言模型的方法从文献中提取生物医学知识。
J Am Med Inform Assoc. 2024 Sep 1;31(9):1904-1911. doi: 10.1093/jamia/ocae061.
8
MetaTron: advancing biomedical annotation empowering relation annotation and collaboration.MetaTron:推进生物医学标注,赋能关系标注与协作。
BMC Bioinformatics. 2024 Mar 14;25(1):112. doi: 10.1186/s12859-024-05730-9.
9
Cancer-Alterome: a literature-mined resource for regulatory events caused by genetic alterations in cancer.癌症变异组:一个通过文献挖掘得到的、关于癌症中基因改变所引发调控事件的资源。
Sci Data. 2024 Mar 2;11(1):265. doi: 10.1038/s41597-024-03083-9.
10
Features of lexical complexity: insights from L1 and L2 speakers.词汇复杂性的特征:来自母语和非母语使用者的见解。
Front Artif Intell. 2023 Nov 30;6:1236963. doi: 10.3389/frai.2023.1236963. eCollection 2023.
Nucleic Acids Res. 2012 Jan;40(Database issue):D559-64. doi: 10.1093/nar/gkr1028. Epub 2011 Nov 18.
4
The Mouse Genome Database (MGD): comprehensive resource for genetics and genomics of the laboratory mouse.小鼠基因组数据库 (MGD):实验室小鼠遗传学和基因组学的综合资源。
Nucleic Acids Res. 2012 Jan;40(Database issue):D881-6. doi: 10.1093/nar/gkr974. Epub 2011 Nov 10.
5
Logical development of the cell ontology.细胞本体论的逻辑发展。
BMC Bioinformatics. 2011 Jan 5;12:6. doi: 10.1186/1471-2105-12-6.
6
Entrez Gene: gene-centered information at NCBI.Entrez基因:美国国立医学图书馆国家生物技术信息中心的基因中心信息。
Nucleic Acids Res. 2011 Jan;39(Database issue):D52-7. doi: 10.1093/nar/gkq1237. Epub 2010 Nov 28.
7
Desiderata for ontologies to be used in semantic annotation of biomedical documents.用于生物医学文献语义标注的本体的需求。
J Biomed Inform. 2011 Feb;44(1):94-101. doi: 10.1016/j.jbi.2010.10.002. Epub 2010 Oct 26.
8
The structural and content aspects of abstracts versus bodies of full text journal articles are different.文摘的结构和内容方面与全文期刊文章的不同。
BMC Bioinformatics. 2010 Sep 29;11:492. doi: 10.1186/1471-2105-11-492.
9
Evolution of the Sequence Ontology terms and relationships.序列本体论术语和关系的演变。
J Biomed Inform. 2011 Feb;44(1):87-93. doi: 10.1016/j.jbi.2010.03.002. Epub 2010 Mar 10.
10
CALBC silver standard corpus.CALBC银标准语料库。
J Bioinform Comput Biol. 2010 Feb;8(1):163-79. doi: 10.1142/s0219720010004562.