Zhi Degui, Liu Nianjun, Zhang Kui
Section on Statistical Genetics, Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL 35294, United States.
Section on Statistical Genetics, Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL 35294, United States.
Methods. 2015 Jun;79-80:41-6. doi: 10.1016/j.ymeth.2015.01.016. Epub 2015 Jan 30.
Next-generation sequencing (NGS) technologies, which can provide base-pair resolution genetic information for all types of genetic variations, are increasingly used in genetics research. However, due to the complex nature of NGS technologies and analytics and their relatively high cost, investigators face practical challenges for both design and analysis. These challenges are further complicated by recent methodological developments that make it possible to use haplotype information in sequencing reads. In light of these developments, we conducted comprehensive simulations to evaluate the effects of sequencing coverage, insert size of paired-end reads, and sample size on genotype calling and haplotype phasing in NGS studies. In contrast to previous studies that typically use idealized scenarios to tease out the effects of individual design and analytic decisions, we used a complete analytical pipeline from read mapping and variant detection to genotype calling and haplotype phasing so that we can assess the joint effects of multiple decisions and thus make more realistic recommendations to investigators. Consistent with previous studies, we found that the use of haplotype information in reads can improve the accuracy of genotype calling and haplotype phasing, and we also found that a mixture of short and long insert sizes of paired-end reads may offer even greater accuracy. However, this benefit is only clear in high coverage sequencing where variant detection is close to perfect. Finally, we observed that LD-based refinement methods do not always outperform single site based methods for genotype calling. Therefore, we should choose analytical methods that are appropriate to the sequencing coverage and sample size in order to use haplotype information in sequencing reads.
新一代测序(NGS)技术能够为所有类型的基因变异提供碱基对分辨率的遗传信息,在遗传学研究中的应用越来越广泛。然而,由于NGS技术及其分析方法的复杂性以及相对较高的成本,研究人员在设计和分析方面面临实际挑战。最近的方法学发展使得在测序读数中使用单倍型信息成为可能,这进一步加剧了这些挑战。鉴于这些发展,我们进行了全面的模拟,以评估测序覆盖度、双端读数的插入片段大小和样本量对NGS研究中基因型分型和单倍型定相的影响。与以往通常使用理想化场景来梳理单个设计和分析决策影响的研究不同,我们使用了从读段比对、变异检测到基因型分型和单倍型定相的完整分析流程,以便能够评估多个决策的联合影响,从而为研究人员提出更现实的建议。与以往研究一致,我们发现利用读数中的单倍型信息可以提高基因型分型和单倍型定相准确性,并且我们还发现双端读数中短插入片段大小和长插入片段大小混合使用可能会提供更高的准确性。然而,这种优势仅在变异检测接近完美的高覆盖度测序中才明显。最后,我们观察到基于连锁不平衡的优化方法在基因型分型方面并不总是优于基于单位点的方法。因此,为了在测序读数中使用单倍型信息,我们应该选择适合测序覆盖度和样本量的分析方法。