• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

科罗拉多生物医学期刊文章丰富注释全文(CRAFT)语料库中的共指标注与消解

Coreference annotation and resolution in the Colorado Richly Annotated Full Text (CRAFT) corpus of biomedical journal articles.

作者信息

Cohen K Bretonnel, Lanfranchi Arrick, Choi Miji Joo-Young, Bada Michael, Baumgartner William A, Panteleyeva Natalya, Verspoor Karin, Palmer Martha, Hunter Lawrence E

机构信息

Computational Bioscience Program, University of Colorado School of Medicine, Denver, CO, USA.

Department of Linguistics, University of Colorado at Boulder, Boulder, Colorado, USA.

出版信息

BMC Bioinformatics. 2017 Aug 17;18(1):372. doi: 10.1186/s12859-017-1775-9.

DOI:10.1186/s12859-017-1775-9
PMID:28818042
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5561560/
Abstract

BACKGROUND

Coreference resolution is the task of finding strings in text that have the same referent as other strings. Failures of coreference resolution are a common cause of false negatives in information extraction from the scientific literature. In order to better understand the nature of the phenomenon of coreference in biomedical publications and to increase performance on the task, we annotated the Colorado Richly Annotated Full Text (CRAFT) corpus with coreference relations.

RESULTS

The corpus was manually annotated with coreference relations, including identity and appositives for all coreferring base noun phrases. The OntoNotes annotation guidelines, with minor adaptations, were used. Interannotator agreement ranges from 0.480 (entity-based CEAF) to 0.858 (Class-B3), depending on the metric that is used to assess it. The resulting corpus adds nearly 30,000 annotations to the previous release of the CRAFT corpus. Differences from related projects include a much broader definition of markables, connection to extensive annotation of several domain-relevant semantic classes, and connection to complete syntactic annotation. Tool performance was benchmarked on the data. A publicly available out-of-the-box, general-domain coreference resolution system achieved an F-measure of 0.14 (B3), while a simple domain-adapted rule-based system achieved an F-measure of 0.42. An ensemble of the two reached F of 0.46. Following the IDENTITY chains in the data would add 106,263 additional named entities in the full 97-paper corpus, for an increase of 76% percent in the semantic classes of the eight ontologies that have been annotated in earlier versions of the CRAFT corpus.

CONCLUSIONS

The project produced a large data set for further investigation of coreference and coreference resolution in the scientific literature. The work raised issues in the phenomenon of reference in this domain and genre, and the paper proposes that many mentions that would be considered generic in the general domain are not generic in the biomedical domain due to their referents to specific classes in domain-specific ontologies. The comparison of the performance of a publicly available and well-understood coreference resolution system with a domain-adapted system produced results that are consistent with the notion that the requirements for successful coreference resolution in this genre are quite different from those of the general domain, and also suggest that the baseline performance difference is quite large.

摘要

背景

共指消解是在文本中寻找与其他字符串具有相同指代对象的字符串的任务。共指消解失败是科学文献信息提取中假阴性的常见原因。为了更好地理解生物医学出版物中共指现象的本质并提高该任务的性能,我们用共指关系对科罗拉多丰富注释全文(CRAFT)语料库进行了注释。

结果

该语料库手动标注了共指关系,包括所有共指基本名词短语的同一性和同位关系。使用了经过少量调整的OntoNotes注释指南。根据用于评估的指标,注释者之间的一致性范围从0.480(基于实体的CEAF)到0.858(B3类)。所得语料库在前一版本的CRAFT语料库基础上增加了近30,000条注释。与相关项目的差异包括可标记物的定义更为宽泛、与多个领域相关语义类别的广泛注释相关联以及与完整句法注释相关联。在这些数据上对工具性能进行了基准测试。一个公开可用的开箱即用的通用领域共指消解系统的F值为0.14(B3),而一个简单的基于领域适应规则的系统的F值为0.42。两者结合后的F值为0.46。按照数据中的同一性链,在完整的97篇论文语料库中会增加106,263个额外的命名实体,使CRAFT语料库早期版本中已注释的八个本体的语义类增加76%。

结论

该项目生成了一个大型数据集,用于进一步研究科学文献中的共指和共指消解。这项工作提出了该领域和体裁中指代现象的问题,并且本文提出,由于许多提及在通用领域会被视为通用的,但在生物医学领域因其指代特定领域本体中的特定类别而并非通用。将一个公开可用且广为人知的共指消解系统与一个领域适应系统的性能进行比较,所得结果与这样一种观点一致,即这种体裁中成功进行共指消解的要求与通用领域的要求有很大不同,并且还表明基线性能差异相当大。

相似文献

1
Coreference annotation and resolution in the Colorado Richly Annotated Full Text (CRAFT) corpus of biomedical journal articles.科罗拉多生物医学期刊文章丰富注释全文(CRAFT)语料库中的共指标注与消解
BMC Bioinformatics. 2017 Aug 17;18(1):372. doi: 10.1186/s12859-017-1775-9.
2
Bio-SCoRes: A Smorgasbord Architecture for Coreference Resolution in Biomedical Text.生物共指消解评分系统(Bio-SCoRes):一种用于生物医学文本共指消解的混合架构
PLoS One. 2016 Mar 2;11(3):e0148538. doi: 10.1371/journal.pone.0148538. eCollection 2016.
3
A categorical analysis of coreference resolution errors in biomedical texts.生物医学文本中指代消解错误的分类分析。
J Biomed Inform. 2016 Apr;60:309-18. doi: 10.1016/j.jbi.2016.02.015. Epub 2016 Feb 27.
4
Concept annotation in the CRAFT corpus.概念标注在 CRAFT 语料库中。
BMC Bioinformatics. 2012 Jul 9;13:161. doi: 10.1186/1471-2105-13-161.
5
A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools.语料库全文期刊文章是一种强大的评估工具,可用于揭示生物医学自然语言处理工具性能的差异。
BMC Bioinformatics. 2012 Aug 17;13:207. doi: 10.1186/1471-2105-13-207.
6
Gold-standard ontology-based anatomical annotation in the CRAFT Corpus.CRAFT语料库中基于金标准本体的解剖学标注
Database (Oxford). 2017 Jan 1;2017. doi: 10.1093/database/bax087.
7
Sortal anaphora resolution to enhance relation extraction from biomedical literature.用于增强从生物医学文献中提取关系的类别指代消解。
BMC Bioinformatics. 2016 Apr 14;17:163. doi: 10.1186/s12859-016-1009-6.
8
Desiderata for ontologies to be used in semantic annotation of biomedical documents.用于生物医学文献语义标注的本体的需求。
J Biomed Inform. 2011 Feb;44(1):94-101. doi: 10.1016/j.jbi.2010.10.002. Epub 2010 Oct 26.
9
Distinguished representation of identical mentions in bio-entity coreference resolution.生物实体共指消解中相同提及的出色表示。
BMC Med Inform Decis Mak. 2022 Apr 30;22(1):116. doi: 10.1186/s12911-022-01862-1.
10
Using domain knowledge and domain-inspired discourse model for coreference resolution for clinical narratives.利用领域知识和领域启发的语篇模型解决临床叙述中的共指消解问题。
J Am Med Inform Assoc. 2013 Mar-Apr;20(2):356-62. doi: 10.1136/amiajnl-2011-000767. Epub 2012 Jul 10.

引用本文的文献

1
Transformers and large language models in healthcare: A review.医疗保健中的变压器和大型语言模型:综述。
Artif Intell Med. 2024 Aug;154:102900. doi: 10.1016/j.artmed.2024.102900. Epub 2024 Jun 5.
2
Opportunities and challenges for ChatGPT and large language models in biomedicine and health.ChatGPT 和大型语言模型在生物医学和健康领域的机遇与挑战。
Brief Bioinform. 2023 Nov 22;25(1). doi: 10.1093/bib/bbad493.
3
Examining linguistic shifts between preprints and publications.考察预印本和出版物之间的语言变化。
PLoS Biol. 2022 Feb 1;20(2):e3001470. doi: 10.1371/journal.pbio.3001470. eCollection 2022 Feb.
4
Linking chemical and disease entities to ontologies by integrating PageRank with extracted relations from literature.通过将PageRank与从文献中提取的关系相结合,将化学和疾病实体与本体进行关联。
J Cheminform. 2020 Sep 21;12(1):57. doi: 10.1186/s13321-020-00461-4.
5
Named Entity Recognition and Relation Detection for Biomedical Information Extraction.用于生物医学信息提取的命名实体识别与关系检测
Front Cell Dev Biol. 2020 Aug 28;8:673. doi: 10.3389/fcell.2020.00673. eCollection 2020.
6
Metabolomics and Multi-Omics Integration: A Survey of Computational Methods and Resources.代谢组学与多组学整合:计算方法与资源综述
Metabolites. 2020 May 15;10(5):202. doi: 10.3390/metabo10050202.
7
CollaboNet: collaboration of deep neural networks for biomedical named entity recognition.CollaboNet:用于生物医学命名实体识别的深度神经网络协作。
BMC Bioinformatics. 2019 May 29;20(Suppl 10):249. doi: 10.1186/s12859-019-2813-6.
8
BioCreative VI Precision Medicine Track system performance is constrained by entity recognition and variations in corpus characteristics.生物创意 VI 精准医疗轨道系统的性能受到实体识别和语料库特征变化的限制。
Database (Oxford). 2018 Jan 1;2018:bay122. doi: 10.1093/database/bay122.
9
A Metadata Extraction Approach for Clinical Case Reports to Enable Advanced Understanding of Biomedical Concepts.一种用于临床病例报告的元数据提取方法,以促进对生物医学概念的深入理解。
J Vis Exp. 2018 Sep 20(139):58392. doi: 10.3791/58392.
10
Annotation and detection of drug effects in text for pharmacovigilance.用于药物警戒的文本中药物效应的标注与检测。
J Cheminform. 2018 Aug 13;10(1):37. doi: 10.1186/s13321-018-0290-y.

本文引用的文献

1
The contribution of co-reference resolution to supervised relation detection between bacteria and biotopes entities.共指消解对细菌与生物栖息地实体之间监督关系检测的贡献。
BMC Bioinformatics. 2015;16 Suppl 10(Suppl 10):S6. doi: 10.1186/1471-2105-16-S10-S6. Epub 2015 Jul 13.
2
BioC: a minimalist approach to interoperability for biomedical text processing.BioC:一种用于生物医学文本处理的最小互操作方法。
Database (Oxford). 2013 Sep 18;2013:bat064. doi: 10.1093/database/bat064. Print 2013.
3
Improving protein coreference resolution by simple semantic classification.通过简单的语义分类提高蛋白质共指解析的准确性。
BMC Bioinformatics. 2012 Nov 17;13:304. doi: 10.1186/1471-2105-13-304.
4
A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools.语料库全文期刊文章是一种强大的评估工具,可用于揭示生物医学自然语言处理工具性能的差异。
BMC Bioinformatics. 2012 Aug 17;13:207. doi: 10.1186/1471-2105-13-207.
5
Concept annotation in the CRAFT corpus.概念标注在 CRAFT 语料库中。
BMC Bioinformatics. 2012 Jul 9;13:161. doi: 10.1186/1471-2105-13-161.
6
Biological event composition.生物事件组成。
BMC Bioinformatics. 2012 Jun 26;13 Suppl 11(Suppl 11):S7. doi: 10.1186/1471-2105-13-S11-S7.
7
The Genia Event and Protein Coreference tasks of the BioNLP Shared Task 2011.2011 年生物自然语言处理共享任务的 Genia 事件和蛋白质共指任务。
BMC Bioinformatics. 2012 Jun 26;13 Suppl 11(Suppl 11):S1. doi: 10.1186/1471-2105-13-S11-S1.
8
Boosting automatic event extraction from the literature using domain adaptation and coreference resolution.利用领域自适应和共指解析技术提高文献中自动事件抽取的性能。
Bioinformatics. 2012 Jul 1;28(13):1759-65. doi: 10.1093/bioinformatics/bts237. Epub 2012 Apr 25.
9
Anaphoric reference in clinical reports: characteristics of an annotated corpus.临床报告中的照应关系:标注语料库的特点。
J Biomed Inform. 2012 Jun;45(3):507-21. doi: 10.1016/j.jbi.2012.01.010. Epub 2012 Feb 9.
10
A system for coreference resolution for the clinical narrative.临床叙述的共指消解系统。
J Am Med Inform Assoc. 2012 Jul-Aug;19(4):660-7. doi: 10.1136/amiajnl-2011-000599. Epub 2012 Jan 31.