Oh Sehyun, Gravel-Pucillo Kai, Ramos Marcel, Davis Sean, Carey Vince, Morgan Martin, Waldron Levi
City University of New York School of Public Health.
University of Colorado Anschutz School of Medicine.
Res Sq. 2024 May 15:rs.3.rs-4370115. doi: 10.21203/rs.3.rs-4370115/v1.
Advancements in sequencing technologies and the development of new data collection methods produce large volumes of biological data. The Genomic Data Science Analysis, Visualization, and Informatics Lab-space (AnVIL) provides a cloud-based platform for democratizing access to large-scale genomics data and analysis tools. However, utilizing the full capabilities of AnVIL can be challenging for researchers without extensive bioinformatics expertise, especially for executing complex workflows. Here we present the AnVILWorkflow R package, which enables the convenient execution of bioinformatics workflows hosted on AnVIL directly from an R environment. AnVILWorkflowsimplifies the setup of the cloud computing environment, input data formatting, workflow submission, and retrieval of results through intuitive functions. We demonstrate the utility of AnVILWorkflowfor three use cases: bulk RNA-seq analysis with Salmon, metagenomics analysis with bioBakery, and digital pathology image processing with PathML. The key features of AnVILWorkflow include user-friendly browsing of available data and workflows, seamless integration of R and non-R tools within a reproducible analysis pipeline, and accessibility to scalable computing resources without direct management overhead. While some limitations exist around workflow customization, AnVILWorkflowlowers the barrier to taking advantage of AnVIL's resources, especially for exploratory analyses or bulk processing with established workflows. This empowers a broader community of researchers to leverage the latest genomics tools and datasets using familiar R syntax. This package is distributed through the Bioconductor project (https://bioconductor.org/packages/AnVILWorkflow), and the source code is available through GitHub (https://github.com/shbrief/AnVILWorkflow).
测序技术的进步和新数据收集方法的发展产生了大量的生物数据。基因组数据科学分析、可视化和信息学实验室空间(AnVIL)提供了一个基于云的平台,以实现对大规模基因组学数据和分析工具的平等访问。然而,对于没有广泛生物信息学专业知识的研究人员来说,充分利用AnVIL的全部功能可能具有挑战性,尤其是在执行复杂的工作流程时。在这里,我们展示了AnVILWorkflow R包,它能够直接从R环境中方便地执行托管在AnVIL上的生物信息学工作流程。AnVILWorkflow通过直观的函数简化了云计算环境的设置、输入数据格式化、工作流程提交和结果检索。我们展示了AnVILWorkflow在三个用例中的效用:使用Salmon进行批量RNA测序分析、使用bioBakery进行宏基因组学分析以及使用PathML进行数字病理学图像处理。AnVILWorkflow的关键特性包括对可用数据和工作流程的用户友好浏览、在可重复分析管道中R和非R工具的无缝集成,以及无需直接管理开销即可访问可扩展计算资源。虽然在工作流程定制方面存在一些限制,但AnVILWorkflow降低了利用AnVIL资源的障碍,特别是对于探索性分析或使用既定工作流程的批量处理。这使更广泛的研究人员群体能够使用熟悉的R语法利用最新的基因组学工具和数据集。该包通过Bioconductor项目(https://bioconductor.org/packages/AnVILWorkflow)分发,源代码可通过GitHub(https://github.com/shbrief/AnVILWorkflow)获得。