Johnston H Richard, Chopra Pankaj, Wingo Thomas S, Patel Viren, Epstein Michael P, Mulle Jennifer G, Warren Stephen T, Zwick Michael E, Cutler David J
Department of Human Genetics, Emory University School of Medicine, Atlanta, GA 30322.
Department of Biostatistics and Bioinformatics, Emory University Rollins School of Public Health, Atlanta, GA 30322.
Proc Natl Acad Sci U S A. 2017 Mar 7;114(10):E1923-E1932. doi: 10.1073/pnas.1618065114. Epub 2017 Feb 21.
The analysis of human whole-genome sequencing data presents significant computational challenges. The sheer size of datasets places an enormous burden on computational, disk array, and network resources. Here, we present an integrated computational package, PEMapper/PECaller, that was designed specifically to minimize the burden on networks and disk arrays, create output files that are minimal in size, and run in a highly computationally efficient way, with the single goal of enabling whole-genome sequencing at scale. In addition to improved computational efficiency, we implement a statistical framework that allows for a base by base error model, allowing this package to perform as well or better than the widely used Genome Analysis Toolkit (GATK) in all key measures of performance on human whole-genome sequences.
人类全基因组测序数据的分析带来了重大的计算挑战。数据集的庞大规模给计算、磁盘阵列和网络资源带来了巨大负担。在此,我们展示了一个集成计算包PEMapper/PECaller,其专门设计用于将网络和磁盘阵列的负担降至最低,创建尺寸最小的输出文件,并以高度计算高效的方式运行,唯一目标是实现大规模全基因组测序。除了提高计算效率外,我们还实施了一个统计框架,该框架允许基于碱基的错误模型,使该软件包在人类全基因组序列的所有关键性能指标上表现与广泛使用的基因组分析工具包(GATK)相当或更优。