Jun Goo, Wing Mary Kate, Abecasis Gonçalo R, Kang Hyun Min
Human Genetics Center, School of Public Health, The University of Texas Health Science Center at Houston, Houston, Texas 77030, USA; Center for Statistical Genetics and Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, Michigan 48109, USA.
Center for Statistical Genetics and Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, Michigan 48109, USA.
Genome Res. 2015 Jun;25(6):918-25. doi: 10.1101/gr.176552.114. Epub 2015 Apr 16.
The analysis of next-generation sequencing data is computationally and statistically challenging because of the massive volume of data and imperfect data quality. We present GotCloud, a pipeline for efficiently detecting and genotyping high-quality variants from large-scale sequencing data. GotCloud automates sequence alignment, sample-level quality control, variant calling, filtering of likely artifacts using machine-learning techniques, and genotype refinement using haplotype information. The pipeline can process thousands of samples in parallel and requires less computational resources than current alternatives. Experiments with whole-genome and exome-targeted sequence data generated by the 1000 Genomes Project show that the pipeline provides effective filtering against false positive variants and high power to detect true variants. Our pipeline has already contributed to variant detection and genotyping in several large-scale sequencing projects, including the 1000 Genomes Project and the NHLBI Exome Sequencing Project. We hope it will now prove useful to many medical sequencing studies.
由于数据量巨大且数据质量欠佳,下一代测序数据的分析在计算和统计方面都具有挑战性。我们展示了GotCloud,这是一种用于从大规模测序数据中高效检测高质量变异并进行基因分型的流程。GotCloud能自动执行序列比对、样本级质量控制、变异检测、使用机器学习技术过滤可能的伪影,以及使用单倍型信息进行基因型优化。该流程可以并行处理数千个样本,并且比当前的其他方法需要更少的计算资源。对由千人基因组计划生成的全基因组和外显子靶向序列数据进行的实验表明,该流程能有效过滤假阳性变异,并具有检测真实变异的高能力。我们的流程已经在包括千人基因组计划和美国国立卫生研究院心肺血液研究所外显子测序计划在内的几个大规模测序项目中助力变异检测和基因分型。我们希望它现在能对许多医学测序研究有用。