Valdez Joshua, Rueschman Michael, Kim Matthew, Redline Susan, Sahoo Satya S
Division of Medical Informatics and Electrical Engineering and Computer Science Department, Case Western Reserve University, Cleveland, OH, USA.
Departments of Medicine, Brigham and Women's Hospital and Beth Israel Deaconess Medical Center, Harvard University, Boston, MA, USA.
On Move Meaningful Internet Syst. 2016 Oct;10033:699-708. doi: 10.1007/978-3-319-48472-3_43. Epub 2016 Oct 18.
Extraction of structured information from biomedical literature is a complex and challenging problem due to the complexity of biomedical domain and lack of appropriate natural language processing (NLP) techniques. High quality domain ontologies model both data and metadata information at a fine level of granularity, which can be effectively used to accurately extract structured information from biomedical text. Extraction of provenance metadata, which describes the history or source of information, from published articles is an important task to support scientific reproducibility. Reproducibility of results reported by previous research studies is a foundational component of scientific advancement. This is highlighted by the recent initiative by the US National Institutes of Health called "Principles of Rigor and Reproducibility". In this paper, we describe an effective approach to extract provenance metadata from published biomedical research literature using an ontology-enabled NLP platform as part of the Provenance for Clinical and Healthcare Research (ProvCaRe). The ProvCaRe-NLP tool extends the clinical Text Analysis and Knowledge Extraction System (cTAKES) platform using both provenance and biomedical domain ontologies. We demonstrate the effectiveness of ProvCaRe-NLP tool using a corpus of 20 peer-reviewed publications. The results of our evaluation demonstrate that the ProvCaRe-NLP tool has significantly higher recall in extracting provenance metadata as compared to existing NLP pipelines such as MetaMap.
由于生物医学领域的复杂性以及缺乏合适的自然语言处理(NLP)技术,从生物医学文献中提取结构化信息是一个复杂且具有挑战性的问题。高质量的领域本体以精细的粒度对数据和元数据信息进行建模,可有效用于从生物医学文本中准确提取结构化信息。从已发表文章中提取描述信息历史或来源的出处元数据,是支持科学可重复性的一项重要任务。先前研究报告结果的可重复性是科学进步的一个基础组成部分。美国国立卫生研究院最近发起的“严谨性和可重复性原则”倡议就突出了这一点。在本文中,我们描述了一种有效的方法,即使用一个启用本体的NLP平台,作为临床和医疗保健研究出处(ProvCaRe)的一部分,从已发表的生物医学研究文献中提取出处元数据。ProvCaRe-NLP工具使用出处和生物医学领域本体扩展了临床文本分析和知识提取系统(cTAKES)平台。我们使用一个包含20篇同行评审出版物的语料库来证明ProvCaRe-NLP工具的有效性。我们的评估结果表明,与现有NLP管道(如MetaMap)相比,ProvCaRe-NLP工具在提取出处元数据方面具有显著更高的召回率。