Pirooznia Mehdi, Kramer Melissa, Parla Jennifer, Goes Fernando S, Potash James B, McCombie W Richard, Zandi Peter P
Department of Psychiatry and Behavioral Sciences, Johns Hopkins University, Baltimore, MD 21205, USA.
Hum Genomics. 2014 Jul 30;8(1):14. doi: 10.1186/1479-7364-8-14.
The processing and analysis of the large scale data generated by next-generation sequencing (NGS) experiments is challenging and is a burgeoning area of new methods development. Several new bioinformatics tools have been developed for calling sequence variants from NGS data. Here, we validate the variant calling of these tools and compare their relative accuracy to determine which data processing pipeline is optimal.
We developed a unified pipeline for processing NGS data that encompasses four modules: mapping, filtering, realignment and recalibration, and variant calling. We processed 130 subjects from an ongoing whole exome sequencing study through this pipeline. To evaluate the accuracy of each module, we conducted a series of comparisons between the single nucleotide variant (SNV) calls from the NGS data and either gold-standard Sanger sequencing on a total of 700 variants or array genotyping data on a total of 9,935 single-nucleotide polymorphisms. A head to head comparison showed that Genome Analysis Toolkit (GATK) provided more accurate calls than SAMtools (positive predictive value of 92.55% vs. 80.35%, respectively). Realignment of mapped reads and recalibration of base quality scores before SNV calling proved to be crucial to accurate variant calling. GATK HaplotypeCaller algorithm for variant calling outperformed the UnifiedGenotype algorithm. We also showed a relationship between mapping quality, read depth and allele balance, and SNV call accuracy. However, if best practices are used in data processing, then additional filtering based on these metrics provides little gains and accuracies of >99% are achievable.
Our findings will help to determine the best approach for processing NGS data to confidently call variants for downstream analyses. To enable others to implement and replicate our results, all of our codes are freely available at http://metamoodics.org/wes.
下一代测序(NGS)实验产生的大规模数据的处理和分析具有挑战性,是新方法开发的一个新兴领域。已经开发了几种新的生物信息学工具来从NGS数据中调用序列变异。在此,我们验证这些工具的变异调用,并比较它们的相对准确性,以确定哪种数据处理流程是最佳的。
我们开发了一个用于处理NGS数据的统一流程,该流程包括四个模块:映射、过滤、重新比对和重新校准以及变异调用。我们通过这个流程处理了一项正在进行的全外显子组测序研究中的130名受试者。为了评估每个模块的准确性,我们对NGS数据中的单核苷酸变异(SNV)调用与总共700个变异的金标准桑格测序或总共9935个单核苷酸多态性的阵列基因分型数据进行了一系列比较。直接比较表明,基因组分析工具包(GATK)提供的调用比SAMtools更准确(阳性预测值分别为92.55%和80.35%)。在SNV调用之前对映射读取进行重新比对和对碱基质量分数进行重新校准被证明对准确的变异调用至关重要。用于变异调用的GATK单倍型调用算法优于统一基因型算法。我们还展示了映射质量、读取深度和等位基因平衡与SNV调用准确性之间的关系。然而,如果在数据处理中使用最佳实践,那么基于这些指标的额外过滤几乎没有增益,并且可以实现>99%的准确性。
我们的发现将有助于确定处理NGS数据以可靠地调用变异进行下游分析的最佳方法。为了使其他人能够实施和复制我们的结果,我们所有的代码都可在http://metamoodics.org/wes上免费获取。