Zaykin Dmitri V, Pudovkin Alexander, Weir Bruce S
National Institute of Environmental Health Sciences, National Institutes of Health, Research Triangle Park, North Carolina 27709, USA.
Genetics. 2008 Sep;180(1):533-45. doi: 10.1534/genetics.108.089409. Epub 2008 Aug 30.
The correlation between alleles at a pair of genetic loci is a measure of linkage disequilibrium. The square of the sample correlation multiplied by sample size provides the usual test statistic for the hypothesis of no disequilibrium for loci with two alleles and this relation has proved useful for study design and marker selection. Nevertheless, this relation holds only in a diallelic case, and an extension to multiple alleles has not been made. Here we introduce a similar statistic, R2, which leads to a correlation-based test for loci with multiple alleles: for a pair of loci with k and m alleles, and a sample of n individuals, the approximate distribution of n(k - 1)(m - 1)/(km)R2 under independence between loci is chi2(k-1(m-1). One advantage of this statistic is that it can be interpreted as the total correlation between a pair of loci. When the phase of two-locus genotypes is known, the approach is equivalent to a test for the overall correlation between rows and columns in a contingency table. In the phase-known case, R2 is the sum of the squared sample correlations for all km 2 x 2 subtables formed by collapsing to one allele vs. the rest at each locus. We examine the approximate distribution under the null of independence for R2 and report its close agreement with the exact distribution obtained by permutation. The test for independence using R2 is a strong competitor to approaches such as Pearson's chi square, Fisher's exact test, and a test based on Cressie and Read's power divergence statistic. We combine this approach with our previous composite-disequilibrium measures to address the case when the genotypic phase is unknown. Calculation of the new multiallele test statistic and its P-value is very simple and utilizes the approximate distribution of R2. We provide a computer program that evaluates approximate as well as "exact" permutational P-values.
一对基因座上等位基因之间的相关性是连锁不平衡的一种度量。样本相关性的平方乘以样本大小为双等位基因座不存在不平衡的假设提供了常用的检验统计量,并且这种关系已被证明对研究设计和标记选择很有用。然而,这种关系仅在双等位基因情况下成立,尚未扩展到多等位基因情况。在此,我们引入一个类似的统计量(R^2),它可用于对多等位基因座进行基于相关性的检验:对于具有(k)个和(m)个等位基因的一对基因座以及(n)个个体的样本,在基因座之间独立的情况下,(n(k - 1)(m - 1)/(km)R^2)的近似分布为(\chi^2(k - 1)(m - 1))。这个统计量的一个优点是它可以解释为一对基因座之间的总相关性。当双基因座基因型的相位已知时,该方法等同于对列联表中行与列之间的总体相关性进行检验。在相位已知的情况下,(R^2)是通过在每个基因座上将一个等位基因与其余等位基因合并形成的所有(km)个(2×2)子表的样本相关性平方之和。我们研究了(R^2)在独立性原假设下的近似分布,并报告了它与通过置换获得的精确分布的密切一致性。使用(R^2)进行独立性检验是诸如皮尔逊卡方检验、费舍尔精确检验以及基于克雷斯和里德的幂散度统计量的检验等方法的有力竞争对手。我们将此方法与我们之前的复合不平衡度量相结合,以处理基因型相位未知的情况。新的多等位基因检验统计量及其(P)值的计算非常简单,并利用了(R^2)的近似分布。我们提供了一个计算机程序,可评估近似以及“精确”的置换(P)值。