Department of Biomedical Informatics, University of Colorado School of Medicine, University of Colorado, Aurora, CO, USA.
Section of Critical Care Medicine, Department of Pediatrics, University of Colorado School of Medicine, University of Colorado, Aurora, CO, USA.
Sci Data. 2024 Jan 2;11(1):8. doi: 10.1038/s41597-023-02854-0.
Data sharing is necessary to maximize the actionable knowledge generated from research data. Data challenges can encourage secondary analyses of datasets. Data challenges in biomedicine often rely on advanced cloud-based computing infrastructure and expensive industry partnerships. Examples include challenges that use Google Cloud virtual machines and the Sage Bionetworks Dream Challenges platform. Such robust infrastructures can be financially prohibitive for investigators without substantial resources. Given the potential to develop scientific and clinical knowledge and the NIH emphasis on data sharing and reuse, there is a need for inexpensive and computationally lightweight methods for data sharing and hosting data challenges. To fill that gap, we developed a workflow that allows for reproducible model training, testing, and evaluation. We leveraged public GitHub repositories, open-source computational languages, and Docker technology. In addition, we conducted a data challenge using the infrastructure we developed. In this manuscript, we report on the infrastructure, workflow, and data challenge results. The infrastructure and workflow are likely to be useful for data challenges and education.
数据共享对于从研究数据中生成可操作的知识至关重要。数据挑战可以鼓励对数据集进行二次分析。生物医学中的数据挑战通常依赖于先进的基于云的计算基础设施和昂贵的行业合作伙伴关系。例如,使用谷歌云虚拟机和 Sage Bionetworks Dream Challenges 平台的挑战。对于没有大量资源的研究人员来说,这种强大的基础设施在财务上可能是不可行的。鉴于开发科学和临床知识的潜力以及 NIH 对数据共享和重用的强调,需要一种廉价且计算量轻的方法来进行数据共享和托管数据挑战。为了填补这一空白,我们开发了一种允许可重复的模型训练、测试和评估的工作流程。我们利用了公共 GitHub 存储库、开源计算语言和 Docker 技术。此外,我们还使用我们开发的基础设施进行了一次数据挑战。在本文中,我们报告了基础设施、工作流程和数据挑战的结果。该基础设施和工作流程可能对数据挑战和教育有用。