Bansal Vikas, Halpern Aaron L, Axelrod Nelson, Bafna Vineet
Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA 92093, USA.
Genome Res. 2008 Aug;18(8):1336-46. doi: 10.1101/gr.077065.108.
In comparison to genotypes, knowledge about haplotypes (the combination of alleles present on a single chromosome) is much more useful for whole-genome association studies and for making inferences about human evolutionary history. Haplotypes are typically inferred from population genotype data using computational methods. Whole-genome sequence data represent a promising resource for constructing haplotypes spanning hundreds of kilobases for an individual. In this article, we propose a Markov chain Monte Carlo (MCMC) algorithm, HASH (haplotype assembly for single human), for assembling haplotypes from sequenced DNA fragments that have been mapped to a reference genome assembly. The transitions of the Markov chain are generated using min-cut computations on graphs derived from the sequenced fragments. We have applied our method to infer haplotypes using whole-genome shotgun sequence data from a recently sequenced human individual. The high sequence coverage and presence of mate pairs result in fairly long haplotypes (N50 length ~ 350 kb). Based on comparison of the sequenced fragments against the individual haplotypes, we demonstrate that the haplotypes for this individual inferred using HASH are significantly more accurate than the haplotypes estimated using a previously proposed greedy heuristic and a simple MCMC method. Using haplotypes from the HapMap project, we estimate the switch error rate of the haplotypes inferred using HASH to be quite low, ~1.1%. Our Markov chain Monte Carlo algorithm represents a general framework for haplotype assembly that can be applied to sequence data generated by other sequencing technologies. The code implementing the methods and the phased individual haplotypes can be downloaded from (http://www.cse.ucsd.edu/users/vibansal/HASH/).
与基因型相比,关于单倍型(存在于一条染色体上的等位基因组合)的知识对于全基因组关联研究以及推断人类进化历史更为有用。单倍型通常使用计算方法从群体基因型数据中推断出来。全基因组序列数据是构建个体跨越数百千碱基的单倍型的一个有前景的资源。在本文中,我们提出一种马尔可夫链蒙特卡罗(MCMC)算法,即HASH(单个人类的单倍型组装),用于从已映射到参考基因组组装的测序DNA片段中组装单倍型。马尔可夫链的转移是通过对源自测序片段的图进行最小割计算来生成的。我们已应用我们的方法,使用来自最近测序的一个人类个体的全基因组鸟枪法序列数据来推断单倍型。高序列覆盖度和配对末端的存在导致相当长的单倍型(N50长度约为350 kb)。基于将测序片段与个体单倍型进行比较,我们证明使用HASH推断的该个体的单倍型比使用先前提出的贪婪启发式算法和简单MCMC方法估计的单倍型显著更准确。使用来自HapMap项目的单倍型,我们估计使用HASH推断的单倍型的切换错误率相当低,约为1.1%。我们的马尔可夫链蒙特卡罗算法代表了一个用于单倍型组装的通用框架,可应用于由其他测序技术生成的序列数据。实现这些方法的代码和分阶段的个体单倍型可从(http://www.cse.ucsd.edu/users/vibansal/HASH/)下载。