Bioinformatics Group, Wageningen University and Research, The Netherlands.
Wageningen UR Plant Breeding, The Netherlands.
Brief Bioinform. 2018 May 1;19(3):387-403. doi: 10.1093/bib/bbw126.
Haplotypes are the units of inheritance in an organism, and many genetic analyses depend on their precise determination. Methods for haplotyping single individuals use the phasing information available in next-generation sequencing reads, by matching overlapping single-nucleotide polymorphisms while penalizing post hoc nucleotide corrections made. Haplotyping diploids is relatively easy, but the complexity of the problem increases drastically for polyploid genomes, which are found in both model organisms and in economically relevant plant and animal species. Although a number of tools are available for haplotyping polyploids, the effects of the genomic makeup and the sequencing strategy followed on the accuracy of these methods have hitherto not been thoroughly evaluated.We developed the simulation pipeline haplosim to evaluate the performance of three haplotype estimation algorithms for polyploids: HapCompass, HapTree and SDhaP, in settings varying in sequencing approach, ploidy levels and genomic diversity, using tetraploid potato as the model. Our results show that sequencing depth is the major determinant of haplotype estimation quality, that 1 kb PacBio circular consensus sequencing reads and Illumina reads with large insert-sizes are competitive and that all methods fail to produce good haplotypes when ploidy levels increase. Comparing the three methods, HapTree produces the most accurate estimates, but also consumes the most resources. There is clearly room for improvement in polyploid haplotyping algorithms.
单体型是生物遗传的单位,许多遗传分析都依赖于对其的精确确定。单体型分析方法利用下一代测序读取中的相位信息,通过匹配重叠的单核苷酸多态性,同时惩罚事后进行的核苷酸校正。分析二倍体的单体型相对容易,但对于多倍体基因组来说,问题的复杂性会急剧增加,多倍体基因组存在于模式生物以及经济上相关的植物和动物物种中。尽管有许多工具可用于分析多倍体的单体型,但迄今为止,尚未对基因组组成和所采用的测序策略对这些方法的准确性的影响进行彻底评估。我们开发了模拟管道 haplosim,以评估三种用于多倍体的单体型估计算法的性能:HapCompass、HapTree 和 SDhaP,使用四倍体马铃薯作为模型,在测序方法、倍性水平和基因组多样性各不相同的设置下进行评估。我们的结果表明,测序深度是单体型估计质量的主要决定因素,1kb PacBio 圆形一致测序读取和具有大插入大小的 Illumina 读取具有竞争力,并且当倍性水平增加时,所有方法都无法产生良好的单体型。比较这三种方法,HapTree 产生的估计值最准确,但也消耗了最多的资源。显然,多倍体单体型分析算法还有改进的空间。