Delaneau Olivier, Coulonges Cédric, Zagury Jean-François
Chaire de Bioinformatique, Conservatoire National des Arts et Métiers, 292 rue Saint-Martin, 75003 Paris, France.
BMC Bioinformatics. 2008 Dec 16;9:540. doi: 10.1186/1471-2105-9-540.
We have developed a new computational algorithm, Shape-IT, to infer haplotypes under the genetic model of coalescence with recombination developed by Stephens et al in Phase v2.1. It runs much faster than Phase v2.1 while exhibiting the same accuracy. The major algorithmic improvements rely on the use of binary trees to represent the sets of candidate haplotypes for each individual. These binary tree representations: (1) speed up the computations of posterior probabilities of the haplotypes by avoiding the redundant operations made in Phase v2.1, and (2) overcome the exponential aspect of the haplotypes inference problem by the smart exploration of the most plausible pathways (ie. haplotypes) in the binary trees.
Our results show that Shape-IT is several orders of magnitude faster than Phase v2.1 while being as accurate. For instance, Shape-IT runs 50 times faster than Phase v2.1 to compute the haplotypes of 200 subjects on 6,000 segments of 50 SNPs extracted from a standard Illumina 300 K chip (13 days instead of 630 days). We also compared Shape-IT with other widely used software, Gerbil, PL-EM, Fastphase, 2SNP, and Ishape in various tests: Shape-IT and Phase v2.1 were the most accurate in all cases, followed by Ishape and Fastphase. As a matter of speed, Shape-IT was faster than Ishape and Fastphase for datasets smaller than 100 SNPs, but Fastphase became faster -but still less accurate- to infer haplotypes on larger SNP datasets.
Shape-IT deserves to be extensively used for regular haplotype inference but also in the context of the new high-throughput genotyping chips since it permits to fit the genetic model of Phase v2.1 on large datasets. This new algorithm based on tree representations could be used in other HMM-based haplotype inference software and may apply more largely to other fields using HMM.
我们开发了一种新的计算算法Shape-IT,用于在斯蒂芬斯等人在v2.1版本中提出的带有重组的合并遗传模型下推断单倍型。它的运行速度比v2.1版本快得多,同时具有相同的准确性。主要的算法改进依赖于使用二叉树来表示每个个体的候选单倍型集合。这些二叉树表示:(1)通过避免v2.1版本中进行的冗余操作,加快了单倍型后验概率的计算;(2)通过在二叉树中巧妙地探索最合理的路径(即单倍型),克服了单倍型推断问题的指数特性。
我们的结果表明,Shape-IT比v2.1版本快几个数量级,同时准确性相同。例如,在从标准Illumina 300K芯片提取的50个单核苷酸多态性(SNP)的6000个片段上计算200个受试者的单倍型时,Shape-IT的运行速度比v2.1版本快50倍(13天而不是630天)。我们还在各种测试中将Shape-IT与其他广泛使用的软件Gerbil、PL-EM、Fastphase、2SNP和Ishape进行了比较:Shape-IT和v2.1版本在所有情况下都是最准确的,其次是Ishape和Fastphase。在速度方面,对于小于100个SNP的数据集,Shape-IT比Ishape和Fastphase快,但在更大的SNP数据集上推断单倍型时,Fastphase变得更快——但准确性仍然较低。
Shape-IT不仅在常规单倍型推断中值得广泛使用,而且在新的高通量基因分型芯片的背景下也值得使用,因为它能够在大型数据集上拟合v2.1版本的遗传模型。这种基于树表示的新算法可用于其他基于隐马尔可夫模型(HMM)的单倍型推断软件,并可能更广泛地应用于使用HMM的其他领域。