Bafna Vineet, Gusfield Dan, Lancia Giuseppe, Yooseph Shibu
The Center for Advancement of Genomics, 1901 Research Blvd., 6th floor, Rockville, MD 20850, USA.
J Comput Biol. 2003;10(3-4):323-40. doi: 10.1089/10665270360688048.
A full haplotype map of the human genome will prove extremely valuable as it will be used in large-scale screens of populations to associate specific haplotypes with specific complex genetic-influenced diseases. A haplotype map project has been announced by NIH. The biological key to that project is the surprising fact that some human genomic DNA can be partitioned into long blocks where genetic recombination has been rare, leading to strikingly fewer distinct haplotypes in the population than previously expected (Helmuth, 2001; Daly et al., 2001; Stephens et al., 2001; Friss et al., 2001). In this paper we explore the algorithmic implications of the no-recombination in long blocks observation, for the problem of inferring haplotypes in populations. This assumption, together with the standard population-genetic assumption of infinite sites, motivates a model of haplotype evolution where the haplotypes in a population are assumed to evolve along a coalescent, which as a rooted tree is a perfect phylogeny. We consider the following algorithmic problem, called the perfect phylogeny haplotyping problem (PPH), which was introduced by Gusfield (2002) - given n genotypes of length m each, does there exist a set of at most 2n haplotypes such that each genotype is generated by a pair of haplotypes from this set, and such that this set can be derived on a perfect phylogeny? The approach taken by Gusfield (2002) to solve this problem reduces it to established, deep results and algorithms from matroid and graph theory. Although that reduction is quite simple and the resulting algorithm nearly optimal in speed, taken as a whole that approach is quite involved, and in particular, challenging to program. Moreover, anyone wishing to fully establish, by reading existing literature, the correctness of the entire algorithm would need to read several deep and difficult papers in graph and matroid theory. However, as stated by Gusfield (2002), many simplifications are possible and the list of "future work" in Gusfield (2002) began with the task of developing a simpler, more direct, yet still efficient algorithm. This paper accomplishes that goal, for both the rooted and unrooted PPH problems. It establishes a simple, easy-to-program, O(nm(2))-time algorithm that determines whether there is a PPH solution for input genotypes and produces a linear-space data structure to represent all of the solutions. The approach allows complete, self-contained proofs. In addition to algorithmic simplicity, the approach here makes the representation of all solutions more intuitive than in Gusfield (2002), and solves another goal from that paper, namely, to prove a nontrivial upper bound on the number of PPH solutions, showing that that number is vastly smaller than the number of haplotype solutions (each solution being a set of n pairs of haplotypes that can generate the genotypes) when the perfect phylogeny requirement is not imposed.
人类基因组的完整单倍型图谱将被证明具有极高的价值,因为它将用于对人群进行大规模筛查,以将特定单倍型与特定的复杂基因影响疾病联系起来。美国国立卫生研究院(NIH)已宣布开展一个单倍型图谱项目。该项目的生物学关键在于一个令人惊讶的事实,即一些人类基因组DNA可被划分为长片段,在这些片段中基因重组很少发生,导致人群中不同单倍型的数量比先前预期的要少得多(赫尔穆特,2001年;戴利等人,2001年;斯蒂芬斯等人,2001年;弗里斯等人,2001年)。在本文中,我们探讨了长片段中无重组这一观察结果在算法方面的影响,针对人群中单倍型推断问题。这一假设,连同无限位点的标准群体遗传学假设,促成了一个单倍型进化模型,其中假设人群中的单倍型沿着一个溯祖过程进化,作为一棵有根树,它是一个完美系统发育树。我们考虑以下算法问题,称为完美系统发育单倍型问题(PPH),这是由古兹菲尔德(2002年)提出的——给定n个长度均为m的基因型,是否存在一组至多2n个单倍型,使得每个基因型由该集合中的一对单倍型产生,并且使得这个集合可以从一个完美系统发育树推导出来?古兹菲尔德(2002年)解决这个问题所采用的方法将其简化为来自拟阵和图论的既定的、深入的结果和算法。尽管这种简化相当简单,并且所得算法在速度上几乎是最优的,但总体而言,那种方法相当复杂,特别是编程具有挑战性。此外,任何希望通过阅读现有文献来完全确定整个算法正确性的人都需要阅读几篇关于图论和拟阵理论的深奥且困难的论文。然而,正如古兹菲尔德(2002年)所述,许多简化是可能的,并且古兹菲尔德(2002年)中“未来工作”列表从开发一个更简单、更直接但仍然高效的算法的任务开始。本文实现了这一目标,针对有根和无根的PPH问题。它建立了一个简单、易于编程的O(nm(2))时间算法,该算法确定输入基因型是否存在PPH解决方案,并生成一个线性空间数据结构来表示所有解决方案。该方法允许进行完整的、自成体系的证明。除了算法简单性之外,这里的方法使所有解决方案的表示比古兹菲尔德(2002年)中的更直观,并且解决了该论文中的另一个目标,即证明PPH解决方案数量的一个非平凡上界,表明当不施加完美系统发育要求时,该数量比单倍型解决方案的数量(每个解决方案是一组n对可以产生基因型的单倍型)要小得多。