Denaxas Spiros, Direk Kenan, Gonzalez-Izquierdo Arturo, Pikoula Maria, Cakiroglu Aylin, Moore Jason, Hemingway Harry, Smeeth Liam
Institute of Health Informatics, University College London, 222 Euston Road, London, NW1 2DA UK.
Farr Institute of Health Informatics Research, 222 Euston Road, London, UK.
BioData Min. 2017 Sep 11;10:31. doi: 10.1186/s13040-017-0151-7. eCollection 2017.
The ability of external investigators to reproduce published scientific findings is critical for the evaluation and validation of biomedical research by the wider community. However, a substantial proportion of health research using electronic health records (EHR), data collected and generated during clinical care, is potentially not reproducible mainly due to the fact that the implementation details of most data preprocessing, cleaning, phenotyping and analysis approaches are not systematically made available or shared. With the complexity, volume and variety of electronic health record data sources made available for research steadily increasing, it is critical to ensure that scientific findings from EHR data are reproducible and replicable by researchers. Reporting guidelines, such as RECORD and STROBE, have set a solid foundation by recommending a series of items for researchers to include in their research outputs. Researchers however often lack the technical tools and methodological approaches to actuate such recommendations in an efficient and sustainable manner.
In this paper, we review and propose a series of methods and tools utilized in adjunct scientific disciplines that can be used to enhance the reproducibility of research using electronic health records and enable researchers to report analytical approaches in a transparent manner. Specifically, we discuss the adoption of scientific software engineering principles and best-practices such as test-driven development, source code revision control systems, literate programming and the standardization and re-use of common data management and analytical approaches.
The adoption of such approaches will enable scientists to systematically document and share EHR analytical workflows and increase the reproducibility of biomedical research using such complex data sources.
外部研究人员重现已发表科学发现的能力对于广大科学界评估和验证生物医学研究至关重要。然而,相当一部分利用电子健康记录(EHR)(临床护理期间收集和生成的数据)开展的健康研究可能无法重现,主要原因是大多数数据预处理、清理、表型分析和分析方法的实施细节未得到系统提供或共享。随着可用于研究的电子健康记录数据源的复杂性、数量和多样性不断增加,确保研究人员能够重现和复制来自电子健康记录数据的科学发现至关重要。诸如RECORD和STROBE等报告指南通过推荐一系列项目供研究人员纳入其研究成果,奠定了坚实基础。然而,研究人员往往缺乏以高效且可持续的方式落实这些建议的技术工具和方法。
在本文中,我们回顾并提出了一系列在相关科学学科中使用的方法和工具,这些方法和工具可用于提高使用电子健康记录的研究的可重复性,并使研究人员能够以透明的方式报告分析方法。具体而言,我们讨论了科学软件工程原则和最佳实践的采用,如测试驱动开发、源代码版本控制系统、文学编程以及通用数据管理和分析方法的标准化与重用。
采用这些方法将使科学家能够系统地记录和共享电子健康记录分析工作流程,并提高使用此类复杂数据源的生物医学研究的可重复性。