Zhou Yong, Kathiresan Nagarajan, Yu Zhichao, Rivera Luis F, Yang Yujian, Thimma Manjula, Manickam Keerthana, Chebotarov Dmytro, Mauleon Ramil, Chougule Kapeel, Wei Sharon, Gao Tingting, Green Carl D, Zuccolo Andrea, Xie Weibo, Ware Doreen, Zhang Jianwei, McNally Kenneth L, Wing Rod A
Center for Desert Agriculture (CDA), Biological and Environmental Sciences & Engineering Division (BESE), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia.
Arizona Genomics Institute (AGI), School of Plant Sciences, University of Arizona, Tucson, AZ, 85721, USA.
BMC Biol. 2024 Jan 25;22(1):13. doi: 10.1186/s12915-024-01820-5.
Single-nucleotide polymorphisms (SNPs) are the most widely used form of molecular genetic variation studies. As reference genomes and resequencing data sets expand exponentially, tools must be in place to call SNPs at a similar pace. The genome analysis toolkit (GATK) is one of the most widely used SNP calling software tools publicly available, but unfortunately, high-performance computing versions of this tool have yet to become widely available and affordable.
Here we report an open-source high-performance computing genome variant calling workflow (HPC-GVCW) for GATK that can run on multiple computing platforms from supercomputers to desktop machines. We benchmarked HPC-GVCW on multiple crop species for performance and accuracy with comparable results with previously published reports (using GATK alone). Finally, we used HPC-GVCW in production mode to call SNPs on a "subpopulation aware" 16-genome rice reference panel with ~ 3000 resequenced rice accessions. The entire process took ~ 16 weeks and resulted in the identification of an average of 27.3 M SNPs/genome and the discovery of ~ 2.3 million novel SNPs that were not present in the flagship reference genome for rice (i.e., IRGSP RefSeq).
This study developed an open-source pipeline (HPC-GVCW) to run GATK on HPC platforms, which significantly improved the speed at which SNPs can be called. The workflow is widely applicable as demonstrated successfully for four major crop species with genomes ranging in size from 400 Mb to 2.4 Gb. Using HPC-GVCW in production mode to call SNPs on a 25 multi-crop-reference genome data set produced over 1.1 billion SNPs that were publicly released for functional and breeding studies. For rice, many novel SNPs were identified and were found to reside within genes and open chromatin regions that are predicted to have functional consequences. Combined, our results demonstrate the usefulness of combining a high-performance SNP calling architecture solution with a subpopulation-aware reference genome panel for rapid SNP discovery and public deployment.
单核苷酸多态性(SNP)是分子遗传变异研究中使用最广泛的形式。随着参考基因组和重测序数据集呈指数级增长,必须具备能以相似速度识别SNP的工具。基因组分析工具包(GATK)是公开可用的使用最广泛的SNP识别软件工具之一,但遗憾的是,该工具的高性能计算版本尚未广泛可用且价格亲民。
在此,我们报告了一种用于GATK的开源高性能计算基因组变异识别工作流程(HPC-GVCW),它可以在从超级计算机到台式机的多个计算平台上运行。我们在多个作物物种上对HPC-GVCW进行了性能和准确性基准测试,结果与之前发表的报告(仅使用GATK)相当。最后,我们在生产模式下使用HPC-GVCW在一个“亚群感知”的16基因组水稻参考面板上识别SNP,该面板包含约3000个重测序水稻品种。整个过程耗时约16周,平均每个基因组识别出2730万个SNP,并发现了约230万个水稻旗舰参考基因组(即国际水稻基因组测序计划参考序列,IRGSP RefSeq)中不存在的新SNP。
本研究开发了一种开源流程(HPC-GVCW),用于在高性能计算平台上运行GATK,显著提高了SNP的识别速度。该工作流程具有广泛适用性,已成功应用于四种主要作物物种,其基因组大小从400 Mb到2.4 Gb不等。在生产模式下使用HPC-GVCW在一个25个多作物参考基因组数据集上识别SNP,产生了超过11亿个SNP,并已公开发布用于功能和育种研究。对于水稻,识别出了许多新SNP,且发现它们位于预计具有功能后果的基因和开放染色质区域内。综合来看,我们的结果证明了将高性能SNP识别架构解决方案与亚群感知参考基因组面板相结合,对于快速SNP发现和公共应用的有用性。