Department of Population Health and Reproduction, University of California, Davis, 1 Shields Avenue, Davis, CA 95616, USA.
Department of Animal Science, University of California, Davis, 1 Shields Avenue, Davis, CA 95616, USA.
Gigascience. 2021 Jan 13;10(1). doi: 10.1093/gigascience/giaa140.
As the scale of biological data generation has increased, the bottleneck of research has shifted from data generation to analysis. Researchers commonly need to build computational workflows that include multiple analytic tools and require incremental development as experimental insights demand tool and parameter modifications. These workflows can produce hundreds to thousands of intermediate files and results that must be integrated for biological insight. Data-centric workflow systems that internally manage computational resources, software, and conditional execution of analysis steps are reshaping the landscape of biological data analysis and empowering researchers to conduct reproducible analyses at scale. Adoption of these tools can facilitate and expedite robust data analysis, but knowledge of these techniques is still lacking. Here, we provide a series of strategies for leveraging workflow systems with structured project, data, and resource management to streamline large-scale biological analysis. We present these practices in the context of high-throughput sequencing data analysis, but the principles are broadly applicable to biologists working beyond this field.
随着生物数据生成规模的增加,研究的瓶颈已经从数据生成转移到了分析。研究人员通常需要构建包含多个分析工具的计算工作流程,并随着实验洞察力的需求,逐步开发工具和参数修改。这些工作流程可能会生成数百到数千个中间文件和结果,这些文件和结果必须进行整合,以获得生物学上的洞察力。以数据为中心的工作流程系统可以内部管理计算资源、软件和分析步骤的条件执行,从而改变了生物数据分析的格局,并使研究人员能够大规模进行可重复的分析。这些工具的采用可以促进和加快稳健的数据分析,但人们对这些技术仍然知之甚少。在这里,我们提供了一系列利用工作流程系统进行结构化项目、数据和资源管理的策略,以简化大规模的生物学分析。我们在高通量测序数据分析的背景下介绍了这些实践,但这些原则广泛适用于超越该领域的生物学家。