Suppr超能文献

科罗拉多生物医学期刊文章丰富注释全文(CRAFT)语料库中的共指标注与消解

Coreference annotation and resolution in the Colorado Richly Annotated Full Text (CRAFT) corpus of biomedical journal articles.

作者信息

Cohen K Bretonnel, Lanfranchi Arrick, Choi Miji Joo-Young, Bada Michael, Baumgartner William A, Panteleyeva Natalya, Verspoor Karin, Palmer Martha, Hunter Lawrence E

机构信息

Computational Bioscience Program, University of Colorado School of Medicine, Denver, CO, USA.

Department of Linguistics, University of Colorado at Boulder, Boulder, Colorado, USA.

出版信息

BMC Bioinformatics. 2017 Aug 17;18(1):372. doi: 10.1186/s12859-017-1775-9.

Abstract

BACKGROUND

Coreference resolution is the task of finding strings in text that have the same referent as other strings. Failures of coreference resolution are a common cause of false negatives in information extraction from the scientific literature. In order to better understand the nature of the phenomenon of coreference in biomedical publications and to increase performance on the task, we annotated the Colorado Richly Annotated Full Text (CRAFT) corpus with coreference relations.

RESULTS

The corpus was manually annotated with coreference relations, including identity and appositives for all coreferring base noun phrases. The OntoNotes annotation guidelines, with minor adaptations, were used. Interannotator agreement ranges from 0.480 (entity-based CEAF) to 0.858 (Class-B3), depending on the metric that is used to assess it. The resulting corpus adds nearly 30,000 annotations to the previous release of the CRAFT corpus. Differences from related projects include a much broader definition of markables, connection to extensive annotation of several domain-relevant semantic classes, and connection to complete syntactic annotation. Tool performance was benchmarked on the data. A publicly available out-of-the-box, general-domain coreference resolution system achieved an F-measure of 0.14 (B3), while a simple domain-adapted rule-based system achieved an F-measure of 0.42. An ensemble of the two reached F of 0.46. Following the IDENTITY chains in the data would add 106,263 additional named entities in the full 97-paper corpus, for an increase of 76% percent in the semantic classes of the eight ontologies that have been annotated in earlier versions of the CRAFT corpus.

CONCLUSIONS

The project produced a large data set for further investigation of coreference and coreference resolution in the scientific literature. The work raised issues in the phenomenon of reference in this domain and genre, and the paper proposes that many mentions that would be considered generic in the general domain are not generic in the biomedical domain due to their referents to specific classes in domain-specific ontologies. The comparison of the performance of a publicly available and well-understood coreference resolution system with a domain-adapted system produced results that are consistent with the notion that the requirements for successful coreference resolution in this genre are quite different from those of the general domain, and also suggest that the baseline performance difference is quite large.

摘要

背景

共指消解是在文本中寻找与其他字符串具有相同指代对象的字符串的任务。共指消解失败是科学文献信息提取中假阴性的常见原因。为了更好地理解生物医学出版物中共指现象的本质并提高该任务的性能,我们用共指关系对科罗拉多丰富注释全文(CRAFT)语料库进行了注释。

结果

该语料库手动标注了共指关系,包括所有共指基本名词短语的同一性和同位关系。使用了经过少量调整的OntoNotes注释指南。根据用于评估的指标,注释者之间的一致性范围从0.480(基于实体的CEAF)到0.858(B3类)。所得语料库在前一版本的CRAFT语料库基础上增加了近30,000条注释。与相关项目的差异包括可标记物的定义更为宽泛、与多个领域相关语义类别的广泛注释相关联以及与完整句法注释相关联。在这些数据上对工具性能进行了基准测试。一个公开可用的开箱即用的通用领域共指消解系统的F值为0.14(B3),而一个简单的基于领域适应规则的系统的F值为0.42。两者结合后的F值为0.46。按照数据中的同一性链,在完整的97篇论文语料库中会增加106,263个额外的命名实体,使CRAFT语料库早期版本中已注释的八个本体的语义类增加76%。

结论

该项目生成了一个大型数据集,用于进一步研究科学文献中的共指和共指消解。这项工作提出了该领域和体裁中指代现象的问题,并且本文提出,由于许多提及在通用领域会被视为通用的,但在生物医学领域因其指代特定领域本体中的特定类别而并非通用。将一个公开可用且广为人知的共指消解系统与一个领域适应系统的性能进行比较,所得结果与这样一种观点一致,即这种体裁中成功进行共指消解的要求与通用领域的要求有很大不同,并且还表明基线性能差异相当大。

相似文献

4
Concept annotation in the CRAFT corpus.概念标注在 CRAFT 语料库中。
BMC Bioinformatics. 2012 Jul 9;13:161. doi: 10.1186/1471-2105-13-161.

引用本文的文献

1
Transformers and large language models in healthcare: A review.医疗保健中的变压器和大型语言模型:综述。
Artif Intell Med. 2024 Aug;154:102900. doi: 10.1016/j.artmed.2024.102900. Epub 2024 Jun 5.
3
Examining linguistic shifts between preprints and publications.考察预印本和出版物之间的语言变化。
PLoS Biol. 2022 Feb 1;20(2):e3001470. doi: 10.1371/journal.pbio.3001470. eCollection 2022 Feb.

本文引用的文献

5
Concept annotation in the CRAFT corpus.概念标注在 CRAFT 语料库中。
BMC Bioinformatics. 2012 Jul 9;13:161. doi: 10.1186/1471-2105-13-161.
6
Biological event composition.生物事件组成。
BMC Bioinformatics. 2012 Jun 26;13 Suppl 11(Suppl 11):S7. doi: 10.1186/1471-2105-13-S11-S7.
10
A system for coreference resolution for the clinical narrative.临床叙述的共指消解系统。
J Am Med Inform Assoc. 2012 Jul-Aug;19(4):660-7. doi: 10.1136/amiajnl-2011-000599. Epub 2012 Jan 31.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验