Lister Hill National Center for Biomedical Communications, National Library of Medicine, 8600 Rockville Pike, Bethesda, 20894, MD, USA.
University of Illinois at Urbana-Champaign, School of Information Sciences, 501 E Daniel Street, Champaign, 61820, IL, USA.
BMC Bioinformatics. 2020 May 14;21(1):188. doi: 10.1186/s12859-020-3517-7.
In the era of information overload, natural language processing (NLP) techniques are increasingly needed to support advanced biomedical information management and discovery applications. In this paper, we present an in-depth description of SemRep, an NLP system that extracts semantic relations from PubMed abstracts using linguistic principles and UMLS domain knowledge. We also evaluate SemRep on two datasets. In one evaluation, we use a manually annotated test collection and perform a comprehensive error analysis. In another evaluation, we assess SemRep's performance on the CDR dataset, a standard benchmark corpus annotated with causal chemical-disease relationships.
A strict evaluation of SemRep on our manually annotated dataset yields 0.55 precision, 0.34 recall, and 0.42 F score. A relaxed evaluation, which more accurately characterizes SemRep performance, yields 0.69 precision, 0.42 recall, and 0.52 F score. An error analysis reveals named entity recognition/normalization as the largest source of errors (26.9%), followed by argument identification (14%) and trigger detection errors (12.5%). The evaluation on the CDR corpus yields 0.90 precision, 0.24 recall, and 0.38 F score. The recall and the F score increase to 0.35 and 0.50, respectively, when the evaluation on this corpus is limited to sentence-bound relationships, which represents a fairer evaluation, as SemRep operates at the sentence level.
SemRep is a broad-coverage, interpretable, strong baseline system for extracting semantic relations from biomedical text. It also underpins SemMedDB, a literature-scale knowledge graph based on semantic relations. Through SemMedDB, SemRep has had significant impact in the scientific community, supporting a variety of clinical and translational applications, including clinical decision making, medical diagnosis, drug repurposing, literature-based discovery and hypothesis generation, and contributing to improved health outcomes. In ongoing development, we are redesigning SemRep to increase its modularity and flexibility, and addressing weaknesses identified in the error analysis.
在信息过载的时代,自然语言处理 (NLP) 技术越来越需要支持先进的生物医学信息管理和发现应用。在本文中,我们深入描述了 SemRep,这是一个使用语言原则和 UMLS 领域知识从 PubMed 摘要中提取语义关系的 NLP 系统。我们还在两个数据集上评估了 SemRep。在一项评估中,我们使用手动注释的测试集进行全面的错误分析。在另一项评估中,我们评估了 SemRep 在 CDR 数据集上的性能,CDR 数据集是一个用因果化学-疾病关系注释的标准基准语料库。
我们在手动注释数据集上对 SemRep 进行严格评估,得到 0.55 的精度、0.34 的召回率和 0.42 的 F 分数。更准确地描述 SemRep 性能的宽松评估得到 0.69 的精度、0.42 的召回率和 0.52 的 F 分数。错误分析表明命名实体识别/标准化是最大的错误源(26.9%),其次是参数识别(14%)和触发检测错误(12.5%)。在 CDR 语料库上的评估得到 0.90 的精度、0.24 的召回率和 0.38 的 F 分数。当对该语料库的评估仅限于句子边界关系时,召回率和 F 分数分别增加到 0.35 和 0.50,这是一个更公平的评估,因为 SemRep 在句子级别上运行。
SemRep 是一个从生物医学文本中提取语义关系的广泛覆盖、可解释、强大的基线系统。它还支持 SemMedDB,这是一个基于语义关系的文献规模的知识图谱。通过 SemMedDB,SemRep 在科学界产生了重大影响,支持了各种临床和转化应用,包括临床决策、医学诊断、药物再利用、基于文献的发现和假设生成,并有助于改善健康结果。在正在进行的开发中,我们正在重新设计 SemRep,以提高其模块化和灵活性,并解决错误分析中发现的弱点。