Institute of Clinical Molecular Biology, Christian-Albrechts-University of Kiel, Rosalind-Franklin-Str. 12, 24105 Kiel, Germany.
Haematology Lab Kiel, Klinik für Innere Medizin II, University Hospital Schleswig-Holstein, Langer Segen 8-10, 24105 Kiel, Germany.
Gigascience. 2021 Jun 29;10(6). doi: 10.1093/gigascience/giab047.
Genome-wide association studies (GWAS) and phenome-wide association studies (PheWAS) involving 1 million GWAS samples from dozens of population-based biobanks present a considerable computational challenge and are carried out by large scientific groups under great expenditure of time and personnel. Automating these processes requires highly efficient and scalable methods and software, but so far there is no workflow solution to easily process 1 million GWAS samples.
Here we present BIGwas, a portable, fully automated quality control and association testing pipeline for large-scale binary and quantitative trait GWAS data provided by biobank resources. By using Nextflow workflow and Singularity software container technology, BIGwas performs resource-efficient and reproducible analyses on a local computer or any high-performance compute (HPC) system with just 1 command, with no need to manually install a software execution environment or various software packages. For a single-command GWAS analysis with 974,818 individuals and 92 million genetic markers, BIGwas takes ∼16 days on a small HPC system with only 7 compute nodes to perform a complete GWAS QC and association analysis protocol. Our dynamic parallelization approach enables shorter runtimes for large HPCs.
Researchers without extensive bioinformatics knowledge and with few computer resources can use BIGwas to perform multi-cohort GWAS with 1 million GWAS samples and, if desired, use it to build their own (genome-wide) PheWAS resource. BIGwas is freely available for download from http://github.com/ikmb/gwas-qc and http://github.com/ikmb/gwas-assoc.
涉及数十个基于人群的生物库的 100 万项 GWAS 样本的全基因组关联研究(GWAS)和表型全基因组关联研究(PheWAS)带来了相当大的计算挑战,并且由大型科学团队在大量时间和人员的投入下进行。自动化这些流程需要高效且可扩展的方法和软件,但到目前为止,还没有一种工作流程解决方案可以轻松处理 100 万项 GWAS 样本。
在这里,我们介绍了 BIGwas,这是一种用于大规模二分类和定量性状 GWAS 数据的便携式、全自动质量控制和关联测试管道,这些数据由生物库资源提供。通过使用 Nextflow 工作流和 Singularity 软件容器技术,BIGwas 只需 1 个命令即可在本地计算机或任何高性能计算(HPC)系统上高效且可重复地进行分析,无需手动安装软件执行环境或各种软件包。对于具有 974818 个人和 9200 万个遗传标记的单命令 GWAS 分析,BIGwas 在具有仅 7 个计算节点的小型 HPC 系统上进行大约 16 天的完整 GWAS QC 和关联分析协议。我们的动态并行化方法使大型 HPC 的运行时间更短。
没有广泛的生物信息学知识和很少计算机资源的研究人员可以使用 BIGwas 对 100 万项 GWAS 样本进行多队列 GWAS,如果需要,还可以使用它来构建自己的(全基因组)PheWAS 资源。BIGwas 可从 http://github.com/ikmb/gwas-qc 和 http://github.com/ikmb/gwas-assoc 免费下载。