Department of Genetics, Harvard Medical School, Boston, MA 02115, USA; Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.
Am J Hum Genet. 2012 Aug 10;91(2):238-51. doi: 10.1016/j.ajhg.2012.06.013.
Haplotypes are an important resource for a large number of applications in human genetics, but computationally inferred haplotypes are subject to switch errors that decrease their utility. The accuracy of computationally inferred haplotypes increases with sample size, and although ever larger genotypic data sets are being generated, the fact that existing methods require substantial computational resources limits their applicability to data sets containing tens or hundreds of thousands of samples. Here, we present HAPI-UR (haplotype inference for unrelated samples), an algorithm that is designed to handle unrelated and/or trio and duo family data, that has accuracy comparable to or greater than existing methods, and that is computationally efficient and can be applied to 100,000 samples or more. We use HAPI-UR to phase a data set with 58,207 samples and show that it achieves practical runtime and that switch errors decrease with sample size even with the use of samples from multiple ethnicities. Using a data set with 16,353 samples, we compare HAPI-UR to Beagle, MaCH, IMPUTE2, and SHAPEIT and show that HAPI-UR runs 18× faster than all methods and has a lower switch-error rate than do other methods except for Beagle; with the use of consensus phasing, running HAPI-UR three times gives a slightly lower switch-error rate than Beagle does and is more than six times faster. We demonstrate results similar to those from Beagle on another data set with a higher marker density. Lastly, we show that HAPI-UR has better runtime scaling properties than does Beagle so that for larger data sets, HAPI-UR will be practical and will have an even larger runtime advantage. HAPI-UR is available online (see Web Resources).
单体型是人类遗传学中许多应用的重要资源,但计算推断的单体型容易发生转换错误,从而降低其使用价值。计算推断的单体型的准确性随着样本量的增加而提高,尽管越来越大的基因型数据集正在生成,但现有的方法需要大量的计算资源,这限制了它们在包含数十万或数十万样本的数据集中的适用性。在这里,我们提出了 HAPI-UR(无关样本单体型推断),这是一种专为处理无关和/或三亲和二联体家族数据而设计的算法,它具有与现有方法相当或更高的准确性,并且计算效率高,可以应用于 10 万个或更多的样本。我们使用 HAPI-UR 对一个包含 58207 个样本的数据集进行了相位分析,结果表明它具有实际的运行时间,并且即使使用来自多个种族的样本,转换错误也会随着样本数量的增加而减少。使用一个包含 16353 个样本的数据集,我们将 HAPI-UR 与 Beagle、MaCH、 IMPUTE2 和 SHAPEIT 进行了比较,结果表明 HAPI-UR 的运行速度比所有方法都快 18 倍,转换错误率比除 Beagle 之外的其他方法都低;使用共识相位,运行 HAPI-UR 三次的转换错误率略低于 Beagle,速度是其的六倍以上。我们在另一个标记密度更高的数据集上展示了与 Beagle 类似的结果。最后,我们表明 HAPI-UR 具有比 Beagle 更好的运行时扩展特性,因此对于更大的数据集,HAPI-UR 将是实用的,并且将具有更大的运行时优势。HAPI-UR 可在线获得(见网络资源)。