Sahoo Satya S, Valdez Joshua, Rueschman Michael
Division of Medical Informatics, School of Medicine, Case Western Reserve University, Cleveland, OH.
Department of Medicine, Brigham and Women's Hospital and Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, MA.
AMIA Annu Symp Proc. 2017 Feb 10;2016:1070-1079. eCollection 2016.
Scientific reproducibility is key to scientific progress as it allows the research community to build on validated results, protect patients from potentially harmful trial drugs derived from incorrect results, and reduce wastage of valuable resources. The National Institutes of Health (NIH) recently published a systematic guideline titled "Rigor and Reproducibility " for supporting reproducible research studies, which has also been accepted by several scientific journals. These journals will require published articles to conform to these new guidelines. Provenance metadata describes the history or origin of data and it has been long used in computer science to capture metadata information for ensuring data quality and supporting scientific reproducibility. In this paper, we describe the development of Provenance for Clinical and healthcare Research (ProvCaRe) framework together with a provenance ontology to support scientific reproducibility by formally modeling a core set of data elements representing details of research study. We extend the PROV Ontology (PROV-O), which has been recommended as the provenance representation model by World Wide Web Consortium (W3C), to represent both: (a) data provenance, and (b) process provenance. We use 124 study variables from 6 clinical research studies from the National Sleep Research Resource (NSRR) to evaluate the coverage of the provenance ontology. NSRR is the largest repository of NIH-funded sleep datasets with 50,000 studies from 36,000 participants. The provenance ontology reuses ontology concepts from existing biomedical ontologies, for example the Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT), to model the provenance information of research studies. The ProvCaRe framework is being developed as part of the Big Data to Knowledge (BD2K) data provenance project.
科学可重复性是科学进步的关键,因为它使研究界能够基于经过验证的结果开展研究,保护患者免受源于错误结果的潜在有害试验药物的影响,并减少宝贵资源的浪费。美国国立卫生研究院(NIH)最近发布了一项名为《严谨性与可重复性》的系统指南,以支持可重复的研究,该指南也已被几家科学期刊所接受。这些期刊将要求发表的文章符合这些新指南。出处元数据描述了数据的历史或来源,并且长期以来一直在计算机科学中用于捕获元数据信息,以确保数据质量并支持科学可重复性。在本文中,我们描述了临床与医疗保健研究出处(ProvCaRe)框架的开发以及一个出处本体,通过对代表研究细节的一组核心数据元素进行形式化建模来支持科学可重复性。我们扩展了被万维网联盟(W3C)推荐为出处表示模型的PROV本体(PROV-O),以同时表示:(a)数据出处,以及(b)过程出处。我们使用来自国家睡眠研究资源(NSRR)的6项临床研究中的124个研究变量来评估出处本体的覆盖范围。NSRR是由NIH资助的睡眠数据集的最大存储库,包含来自36000名参与者的50000项研究。出处本体复用了现有生物医学本体中的本体概念,例如医学临床术语系统命名法(SNOMED CT),来对研究的出处信息进行建模。ProvCaRe框架作为大数据到知识(BD2K)数据出处项目的一部分正在开发中。