Berke Seth R, Kanchan Kanika, Marazita Mary L, Tobin Eric, Ruczinski Ingo
Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, United States of America.
Division of Allergy and Clinical Immunology, Johns Hopkins School of Medicine, Baltimore, Maryland, United States of America.
PLoS Comput Biol. 2025 Jul 2;21(7):e1013215. doi: 10.1371/journal.pcbi.1013215. eCollection 2025 Jul.
As the biomedical data ecosystem increasingly embraces the findable, accessible, interoperable, and reusable (FAIR) data principles to publish multimodal datasets to the cloud, opportunities for cloud-based research continue to expand. Besides the potential for accelerated and diverse biomedical discovery that comes from a harmonized data ecosystem, the cloud also presents a shift away from the standard practice of duplicating data to computational clusters or local computers for analysis. However, despite these benefits, researcher migration to the cloud has lagged, in part due to insufficient educational resources to train biomedical scientists on cloud infrastructure. There exists a conceptual lack especially around the crafting of custom analytic pipelines that require software not pre-installed by cloud analysis platforms. We here present three fundamental concepts necessary for custom pipeline creation in the cloud. These overarching concepts are workflow and cloud provider agnostic, extending the utility of this education to serve as a foundation for any computational analysis running any dataset in any biomedical cloud platform. We illustrate these concepts using one of our own custom analyses, a study using the case-parent trio design to detect sex-specific genetic effects on orofacial cleft (OFC) risk, which we crafted in the biomedical cloud analysis platform CAVATICA.
随着生物医学数据生态系统越来越多地采用可查找、可访问、可互操作和可重用(FAIR)的数据原则,将多模态数据集发布到云端,基于云的研究机会也在不断扩大。除了来自统一数据生态系统的加速和多样化生物医学发现的潜力之外,云还带来了一种转变,即从将数据复制到计算集群或本地计算机进行分析的标准做法中脱离出来。然而,尽管有这些好处,但研究人员向云端的迁移却滞后了,部分原因是缺乏足够的教育资源来培训生物医学科学家使用云基础设施。特别是在创建需要云分析平台未预先安装的软件的自定义分析管道方面,存在概念上的不足。我们在此介绍在云端创建自定义管道所需的三个基本概念。这些总体概念与工作流程和云提供商无关,扩展了这种教育的实用性,使其成为在任何生物医学云平台上运行任何数据集的任何计算分析的基础。我们使用我们自己的一项自定义分析来阐述这些概念,该分析是一项使用病例-父母三联体设计来检测性别特异性基因对口腔颌面裂(OFC)风险影响的研究,我们在生物医学云分析平台CAVATICA中完成了这项分析。