Department of Statistics and Applied Probability, National University of Singapore,117546 Singapore.
Bioinformatics. 2010 Oct 15;26(20):2556-63. doi: 10.1093/bioinformatics/btq492. Epub 2010 Aug 27.
It has been claimed in the literature that pooling DNA samples is efficient in estimating haplotype frequencies. There is, however, no theoretical justification based on calculation of statistical efficiency. In fact, the limited evidence given so far is based on simulation studies with small numbers of loci. With rapid advance in technology, it is of interest to see if pooling is still efficient when the number of loci increases.
Instead of resorting to simulation studies, we make use of asymptotic statistical theory to perform exact calculation of the efficiency of pooling relative to no pooling in the estimation of haplotype frequencies. As an intermediate step, we use the log-linear formulation of the haplotype probabilities and derive the asymptotic variance-covariance matrix of the maximum likelihood estimators of the canonical parameters of the log-linear model.
Based on our calculations under linkage equilibrium, pooling can suffer huge loss in efficiency relative to no pooling when there are more than three independent loci and the alleles are not rare. Pooling works better for rare alleles. In particular, if all the minor allele frequencies are 0.05, pooling maintains an advantage over no pooling until the number of independent loci reaches 6. High linkage disequilibrium effectively reduces the number of independent loci by ruling out certain haplotypes from occurring. Similar calculations of efficiency for the case of no pooling justify the common belief that it is not worthwhile to use molecular methods to resolve the phase ambiguity of individual genotype data.
The R codes for the calculation are available at http://www.stat.nus.edu.sg/∼staxj/pooling
文献中声称, pooled DNA samples 对于估计 haplotype frequencies 是有效的。然而,这并没有基于计算统计效率的理论依据。事实上,迄今为止给出的有限证据是基于小数量的 loci 的模拟研究。随着技术的快速发展,有必要观察当 loci 数量增加时,pooling 是否仍然有效。
我们不依赖于模拟研究,而是利用渐近统计理论来执行 exact calculation,以评估相对于不 pooling 的 haplotype frequencies 估计的 pooling 效率。作为中间步骤,我们使用 haplotype probabilities 的对数线性公式,并推导出对数线性模型的典型参数的最大似然估计的渐近方差-协方差矩阵。
基于我们在 linkage equilibrium 下的计算,当有三个以上独立 loci 且等位基因不罕见时,pooling 相对于不 pooling 会遭受巨大的效率损失。pooling 对稀有等位基因效果更好。特别是,如果所有的 minor allele frequencies 都是 0.05,那么在独立 loci 的数量达到 6 之前,pooling 相对于不 pooling 仍具有优势。高连锁不平衡通过排除某些 haplotypes 的出现,有效地减少了独立 loci 的数量。对于不 pooling 的情况的效率的类似计算证明了一个普遍的信念,即使用分子方法解决个体基因型数据的相位模糊性是不值得的。
计算的 R 代码可在 http://www.stat.nus.edu.sg/∼staxj/pooling 上获得。