Suppr超能文献

从生物医学文献中构建语义谓词黄金标准。

Constructing a semantic predication gold standard from the biomedical literature.

机构信息

Lister Hill National Center for Biomedical Communications, National Library of Medicine, Bethesda, MD, USA.

出版信息

BMC Bioinformatics. 2011 Dec 20;12:486. doi: 10.1186/1471-2105-12-486.

Abstract

BACKGROUND

Semantic relations increasingly underpin biomedical text mining and knowledge discovery applications. The success of such practical applications crucially depends on the quality of extracted relations, which can be assessed against a gold standard reference. Most such references in biomedical text mining focus on narrow subdomains and adopt different semantic representations, rendering them difficult to use for benchmarking independently developed relation extraction systems. In this article, we present a multi-phase gold standard annotation study, in which we annotated 500 sentences randomly selected from MEDLINE abstracts on a wide range of biomedical topics with 1371 semantic predications. The UMLS Metathesaurus served as the main source for conceptual information and the UMLS Semantic Network for relational information. We measured interannotator agreement and analyzed the annotations closely to identify some of the challenges in annotating biomedical text with relations based on an ontology or a terminology.

RESULTS

We obtain fair to moderate interannotator agreement in the practice phase (0.378-0.475). With improved guidelines and additional semantic equivalence criteria, the agreement increases by 12% (0.415 to 0.536) in the main annotation phase. In addition, we find that agreement increases to 0.688 when the agreement calculation is limited to those predications that are based only on the explicitly provided UMLS concepts and relations.

CONCLUSIONS

While interannotator agreement in the practice phase confirms that conceptual annotation is a challenging task, the increasing agreement in the main annotation phase points out that an acceptable level of agreement can be achieved in multiple iterations, by setting stricter guidelines and establishing semantic equivalence criteria. Mapping text to ontological concepts emerges as the main challenge in conceptual annotation. Annotating predications involving biomolecular entities and processes is particularly challenging. While the resulting gold standard is mainly intended to serve as a test collection for our semantic interpreter, we believe that the lessons learned are applicable generally.

摘要

背景

语义关系越来越成为生物医学文本挖掘和知识发现应用的基础。这些实际应用的成功在很大程度上取决于提取关系的质量,而这可以通过与黄金标准参考进行比较来评估。生物医学文本挖掘中的大多数此类参考集中在狭窄的子领域,并采用不同的语义表示,因此难以独立用于基准测试自主开发的关系提取系统。在本文中,我们提出了一项多阶段黄金标准注释研究,其中我们对来自 MEDLINE 摘要的 500 个随机句子进行了注释,涵盖了广泛的生物医学主题,共涉及 1371 个语义谓词。UMLS Metathesaurus 用作概念信息的主要来源,UMLS Semantic Network 用于关系信息。我们测量了注释者之间的一致性,并对注释进行了深入分析,以确定基于本体或术语对生物医学文本进行关系注释所面临的一些挑战。

结果

我们在实践阶段获得了公平到中等的注释者之间的一致性(0.378-0.475)。通过改进指南并增加语义等价标准,在主要注释阶段,一致性提高了 12%(0.415 到 0.536)。此外,我们发现当仅基于明确提供的 UMLS 概念和关系计算一致性时,一致性提高到 0.688。

结论

虽然在实践阶段的注释者之间的一致性确认了概念注释是一项具有挑战性的任务,但在主要注释阶段一致性的提高表明,可以通过设置更严格的指南和建立语义等价标准,在多个迭代中达到可接受的一致性水平。将文本映射到本体概念是概念注释的主要挑战。注释涉及生物分子实体和过程的谓词特别具有挑战性。虽然所得黄金标准主要用于作为我们语义解释器的测试集,但我们认为所吸取的教训具有普遍适用性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9574/3281188/d4f091c93e23/1471-2105-12-486-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验