Valdez Joshua, Kim Matthew, Rueschman Michael, Socrates Vimig, Redline Susan, Sahoo Satya S
Division of Medical Informatics, School of Medicine, Case Western Reserve University, Cleveland, OH.
Department of Medicine, Brigham and Women's Hospital and Beth Israel Deaconess Medical Center, Harvard Medical School, Harvard University Boston, MA.
AMIA Annu Symp Proc. 2018 Apr 16;2017:1705-1714. eCollection 2017.
Scientific reproducibility is critical for biomedical research as it enables us to advance science by building on previous results, helps ensure the success of increasingly expensive drug trials, and allows funding agencies to make informed decisions. However, there is a growing "crisis" of reproducibility as evidenced by a recent Nature journal survey of more than 1500 researchers that found that 70% of researchers were not able to replicate results from other research groups and more than 50% of researchers were not able reproduce their own research results. In 2016, the National Institutes of Health (NIH) announced the "Rigor and Reproducibility" guidelines to support reproducibility in biomedical research. A key component of the NIH Rigor and Reproducibility guidelines is the recording and analysis of "provenance" information, which describes the origin or history of data and plays a central role in ensuring scientific reproducibility. As part of the NIH Big Data to Knowledge (BD2K)-funded data provenance project, we have developed a new informatics framework called Provenance for Clinical and Healthcare Research (ProvCaRe) to extract, model, and analyze provenance information from published literature describing research studies. Using sleep medicine research studies that have made their data available through the National Sleep Research Resource (NSRR), we have developed an automated pipeline to identify and extract provenance metadata from published literature that is made available for analysis in the ProvCaRe knowledgebase. NSRR is the largest repository of sleep data from over 40,000 studies involving 36,000 participants and we used 75 published articles describing 6 research studies to populate the ProvCaRe knowledgebase. We evaluated the ProvCaRe knowledgebase with 28,474 "provenance triples" using hypothesis-driven queries to identify and rank research studies based on the provenance information extracted from published articles.
科学可重复性对于生物医学研究至关重要,因为它使我们能够在前人研究成果的基础上推动科学进步,有助于确保日益昂贵的药物试验取得成功,并使资助机构能够做出明智的决策。然而,可重复性“危机”正在加剧,《自然》杂志最近对1500多名研究人员进行的一项调查显示,70%的研究人员无法复制其他研究小组的结果,超过50%的研究人员无法复制自己的研究结果。2016年,美国国立卫生研究院(NIH)宣布了“严谨与可重复性”指南,以支持生物医学研究中的可重复性。NIH严谨与可重复性指南的一个关键组成部分是对“出处”信息的记录和分析,出处信息描述了数据的来源或历史,在确保科学可重复性方面发挥着核心作用。作为NIH大数据到知识(BD2K)资助的数据出处项目的一部分,我们开发了一个名为临床和医疗保健研究出处(ProvCaRe)的新信息学框架,用于从描述研究的已发表文献中提取、建模和分析出处信息。利用通过国家睡眠研究资源(NSRR)提供数据的睡眠医学研究,我们开发了一个自动化管道,以识别和从已发表文献中提取出处元数据,这些元数据可在ProvCaRe知识库中进行分析。NSRR是来自40000多项研究、涉及36000名参与者的最大睡眠数据存储库,我们使用了75篇描述6项研究的已发表文章来填充ProvCaRe知识库。我们使用假设驱动的查询,以28474个“出处三元组”对ProvCaRe知识库进行了评估,以便根据从已发表文章中提取的出处信息识别研究并对其进行排名。