Cohen K Bretonnel, Lanfranchi Arrick, Choi Miji Joo-Young, Bada Michael, Baumgartner William A, Panteleyeva Natalya, Verspoor Karin, Palmer Martha, Hunter Lawrence E
Computational Bioscience Program, University of Colorado School of Medicine, Denver, CO, USA.
Department of Linguistics, University of Colorado at Boulder, Boulder, Colorado, USA.
BMC Bioinformatics. 2017 Aug 17;18(1):372. doi: 10.1186/s12859-017-1775-9.
Coreference resolution is the task of finding strings in text that have the same referent as other strings. Failures of coreference resolution are a common cause of false negatives in information extraction from the scientific literature. In order to better understand the nature of the phenomenon of coreference in biomedical publications and to increase performance on the task, we annotated the Colorado Richly Annotated Full Text (CRAFT) corpus with coreference relations.
The corpus was manually annotated with coreference relations, including identity and appositives for all coreferring base noun phrases. The OntoNotes annotation guidelines, with minor adaptations, were used. Interannotator agreement ranges from 0.480 (entity-based CEAF) to 0.858 (Class-B3), depending on the metric that is used to assess it. The resulting corpus adds nearly 30,000 annotations to the previous release of the CRAFT corpus. Differences from related projects include a much broader definition of markables, connection to extensive annotation of several domain-relevant semantic classes, and connection to complete syntactic annotation. Tool performance was benchmarked on the data. A publicly available out-of-the-box, general-domain coreference resolution system achieved an F-measure of 0.14 (B3), while a simple domain-adapted rule-based system achieved an F-measure of 0.42. An ensemble of the two reached F of 0.46. Following the IDENTITY chains in the data would add 106,263 additional named entities in the full 97-paper corpus, for an increase of 76% percent in the semantic classes of the eight ontologies that have been annotated in earlier versions of the CRAFT corpus.
The project produced a large data set for further investigation of coreference and coreference resolution in the scientific literature. The work raised issues in the phenomenon of reference in this domain and genre, and the paper proposes that many mentions that would be considered generic in the general domain are not generic in the biomedical domain due to their referents to specific classes in domain-specific ontologies. The comparison of the performance of a publicly available and well-understood coreference resolution system with a domain-adapted system produced results that are consistent with the notion that the requirements for successful coreference resolution in this genre are quite different from those of the general domain, and also suggest that the baseline performance difference is quite large.
共指消解是在文本中寻找与其他字符串具有相同指代对象的字符串的任务。共指消解失败是科学文献信息提取中假阴性的常见原因。为了更好地理解生物医学出版物中共指现象的本质并提高该任务的性能,我们用共指关系对科罗拉多丰富注释全文(CRAFT)语料库进行了注释。
该语料库手动标注了共指关系,包括所有共指基本名词短语的同一性和同位关系。使用了经过少量调整的OntoNotes注释指南。根据用于评估的指标,注释者之间的一致性范围从0.480(基于实体的CEAF)到0.858(B3类)。所得语料库在前一版本的CRAFT语料库基础上增加了近30,000条注释。与相关项目的差异包括可标记物的定义更为宽泛、与多个领域相关语义类别的广泛注释相关联以及与完整句法注释相关联。在这些数据上对工具性能进行了基准测试。一个公开可用的开箱即用的通用领域共指消解系统的F值为0.14(B3),而一个简单的基于领域适应规则的系统的F值为0.42。两者结合后的F值为0.46。按照数据中的同一性链,在完整的97篇论文语料库中会增加106,263个额外的命名实体,使CRAFT语料库早期版本中已注释的八个本体的语义类增加76%。
该项目生成了一个大型数据集,用于进一步研究科学文献中的共指和共指消解。这项工作提出了该领域和体裁中指代现象的问题,并且本文提出,由于许多提及在通用领域会被视为通用的,但在生物医学领域因其指代特定领域本体中的特定类别而并非通用。将一个公开可用且广为人知的共指消解系统与一个领域适应系统的性能进行比较,所得结果与这样一种观点一致,即这种体裁中成功进行共指消解的要求与通用领域的要求有很大不同,并且还表明基线性能差异相当大。