Marchini Jonathan, Cutler David, Patterson Nick, Stephens Matthew, Eskin Eleazar, Halperin Eran, Lin Shin, Qin Zhaohui S, Munro Heather M, Abecasis Goncalo R, Donnelly Peter
Department of Statistics, University of Oxford, Oxford OX1 3TG, United Kingdom.
Am J Hum Genet. 2006 Mar;78(3):437-50. doi: 10.1086/500808. Epub 2006 Jan 26.
Knowledge of haplotype phase is valuable for many analysis methods in the study of disease, population, and evolutionary genetics. Considerable research effort has been devoted to the development of statistical and computational methods that infer haplotype phase from genotype data. Although a substantial number of such methods have been developed, they have focused principally on inference from unrelated individuals, and comparisons between methods have been rather limited. Here, we describe the extension of five leading algorithms for phase inference for handling father-mother-child trios. We performed a comprehensive assessment of the methods applied to both trios and to unrelated individuals, with a focus on genomic-scale problems, using both simulated data and data from the HapMap project. The most accurate algorithm was PHASE (v2.1). For this method, the percentages of genotypes whose phase was incorrectly inferred were 0.12%, 0.05%, and 0.16% for trios from simulated data, HapMap Centre d'Etude du Polymorphisme Humain (CEPH) trios, and HapMap Yoruban trios, respectively, and 5.2% and 5.9% for unrelated individuals in simulated data and the HapMap CEPH data, respectively. The other methods considered in this work had comparable but slightly worse error rates. The error rates for trios are similar to the levels of genotyping error and missing data expected. We thus conclude that all the methods considered will provide highly accurate estimates of haplotypes when applied to trio data sets. Running times differ substantially between methods. Although it is one of the slowest methods, PHASE (v2.1) was used to infer haplotypes for the 1 million-SNP HapMap data set. Finally, we evaluated methods of estimating the value of r(2) between a pair of SNPs and concluded that all methods estimated r(2) well when the estimated value was >or=0.8.
单倍型相位信息对于疾病、群体和进化遗传学研究中的许多分析方法都很有价值。人们投入了大量的研究精力来开发从基因型数据推断单倍型相位的统计和计算方法。尽管已经开发了大量此类方法,但它们主要集中于从不相关个体进行推断,并且方法之间的比较相当有限。在此,我们描述了用于处理父母 - 子女三联体的五种主要相位推断算法的扩展。我们对应用于三联体和不相关个体的方法进行了全面评估,重点关注基因组规模的问题,使用了模拟数据和来自HapMap项目的数据。最准确的算法是PHASE(v2.1)。对于该方法,来自模拟数据的三联体、HapMap人类多态性研究中心(CEPH)三联体和HapMap约鲁巴三联体中,相位被错误推断的基因型百分比分别为0.12%、0.05%和0.16%,而在模拟数据和HapMap CEPH数据中的不相关个体分别为5.2%和5.9%。本研究中考虑的其他方法具有可比但略高的错误率。三联体的错误率与预期的基因分型错误和缺失数据水平相似。因此,我们得出结论,当应用于三联体数据集时,所有考虑的方法都将提供高度准确的单倍型估计。方法之间的运行时间差异很大。尽管PHASE(v2.1)是最慢的方法之一,但它被用于推断100万个单核苷酸多态性(SNP)的HapMap数据集的单倍型。最后,我们评估了估计一对SNP之间r(2)值的方法,并得出结论,当估计值≥0.8时,所有方法对r(2)的估计都很好。