Ahn Soyeon, Vikalo Haris
Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, 78712, Texas, USA.
BMC Bioinformatics. 2015 Jul 16;16:223. doi: 10.1186/s12859-015-0651-8.
Genetic variations predispose individuals to hereditary diseases, play important role in the development of complex diseases, and impact drug metabolism. The full information about the DNA variations in the genome of an individual is given by haplotypes, the ordered lists of single nucleotide polymorphisms (SNPs) located on chromosomes. Affordable high-throughput DNA sequencing technologies enable routine acquisition of data needed for the assembly of single individual haplotypes. However, state-of-the-art high-throughput sequencing platforms generate data that is erroneous, which induces uncertainty in the SNP and genotype calling procedures and, ultimately, adversely affect the accuracy of haplotyping. When inferring haplotype phase information, the vast majority of the existing techniques for haplotype assembly assume that the genotype information is correct. This motivates the development of methods capable of joint genotype calling and haplotype assembly.
We present a haplotype assembly algorithm, ParticleHap, that relies on a probabilistic description of the sequencing data to jointly infer genotypes and assemble the most likely haplotypes. Our method employs a deterministic sequential Monte Carlo algorithm that associates single nucleotide polymorphisms with haplotypes by exhaustively exploring all possible extensions of the partial haplotypes. The algorithm relies on genotype likelihoods rather than on often erroneously called genotypes, thus ensuring a more accurate assembly of the haplotypes. Results on both the 1000 Genomes Project experimental data as well as simulation studies demonstrate that the proposed approach enables highly accurate solutions to the haplotype assembly problem while being computationally efficient and scalable, generally outperforming existing methods in terms of both accuracy and speed.
The developed probabilistic framework and sequential Monte Carlo algorithm enable joint haplotype assembly and genotyping in a computationally efficient manner. Our results demonstrate fast and highly accurate haplotype assembly aided by the re-examination of erroneously called genotypes. A C code implementation of ParticleHap will be available for download from https://sites.google.com/site/asynoeun/particlehap.
基因变异使个体易患遗传性疾病,在复杂疾病的发展中起重要作用,并影响药物代谢。单倍型给出了个体基因组中DNA变异的完整信息,单倍型是位于染色体上的单核苷酸多态性(SNP)的有序列表。经济实惠的高通量DNA测序技术使得常规获取组装单倍型所需的数据成为可能。然而,最先进的高通量测序平台产生的数据存在错误,这在SNP和基因型判定过程中引入了不确定性,并最终对单倍型分型的准确性产生不利影响。在推断单倍型相位信息时,绝大多数现有的单倍型组装技术都假定基因型信息是正确的。这促使了能够联合进行基因型判定和单倍型组装的方法的开发。
我们提出了一种单倍型组装算法ParticleHap,该算法依赖于对测序数据的概率描述来联合推断基因型并组装最可能的单倍型。我们的方法采用了确定性序贯蒙特卡罗算法,通过详尽探索部分单倍型的所有可能扩展,将单核苷酸多态性与单倍型相关联。该算法依赖于基因型似然性而非经常错误判定的基因型,从而确保了单倍型更准确的组装。对千人基因组计划实验数据以及模拟研究的结果表明,所提出的方法能够以高效的计算方式为单倍型组装问题提供高度准确的解决方案,同时在准确性和速度方面通常优于现有方法。
所开发的概率框架和序贯蒙特卡罗算法能够以高效的计算方式实现联合单倍型组装和基因分型。我们的结果表明,通过重新检查错误判定的基因型,能够实现快速且高度准确的单倍型组装。ParticleHap的C代码实现将可从https://sites.google.com/site/asynoeun/particlehap下载。