Suppr超能文献

CALBC银标准语料库。

CALBC silver standard corpus.

作者信息

Rebholz-Schuhmann Dietrich, Jimeno Yepes Antonio José, Van Mulligen Erik M, Kang Ning, Kors Jan, Milward David, Corbett Peter, Buyko Ekaterina, Beisswanger Elena, Hahn Udo

机构信息

EMBL Outstation-Hinxton, European Bioinformatics Institute, Hinxton, Cambridge CB101SD, UK.

出版信息

J Bioinform Comput Biol. 2010 Feb;8(1):163-79. doi: 10.1142/s0219720010004562.

Abstract

The CALBC initiative aims to provide a large-scale biomedical text corpus that contains semantic annotations for named entities of different kinds. The generation of this corpus requires that the annotations from different automatic annotation systems be harmonized. In the first phase, the annotation systems from five participants (EMBL-EBI, EMC Rotterdam, NLM, JULIE Lab Jena, and Linguamatics) were gathered. All annotations were delivered in a common annotation format that included concept identifiers in the boundary assignments and that enabled comparison and alignment of the results. During the harmonization phase, the results produced from those different systems were integrated in a single harmonized corpus ("silver standard" corpus) by applying a voting scheme. We give an overview of the processed data and the principles of harmonization--formal boundary reconciliation and semantic matching of named entities. Finally, all submissions of the participants were evaluated against that silver standard corpus. We found that species and disease annotations are better standardized amongst the partners than the annotations of genes and proteins. The raw corpus is now available for additional named entity annotations. Parts of it will be made available later on for a public challenge. We expect that we can improve corpus building activities both in terms of the numbers of named entity classes being covered, as well as the size of the corpus in terms of annotated documents.

摘要

CALBC计划旨在提供一个大规模生物医学文本语料库,其中包含针对各类命名实体的语义标注。该语料库的生成要求对来自不同自动标注系统的标注进行协调统一。在第一阶段,收集了来自五个参与方(欧洲生物信息研究所、鹿特丹伊拉斯姆斯医学中心、美国国立医学图书馆、耶拿朱莉实验室和Linguamatics公司)的标注系统。所有标注均以通用标注格式提供,该格式在边界分配中包含概念标识符,并能够对结果进行比较和比对。在协调统一阶段,通过应用投票方案,将这些不同系统产生的结果整合到一个单一的协调语料库(“银标准”语料库)中。我们概述了处理后的数据以及协调统一的原则——形式边界协调和命名实体的语义匹配。最后,根据该银标准语料库对各参与方的所有提交内容进行了评估。我们发现,与基因和蛋白质的标注相比,合作伙伴之间物种和疾病的标注标准化程度更高。原始语料库现已可供进行额外的命名实体标注。稍后将提供其中部分内容用于公开挑战。我们期望,在涵盖的命名实体类别数量以及标注文档方面的语料库规模方面,都能够改进语料库构建活动。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验