The Wallace H, Coulter Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, Atlanta, GA 30332, USA.
BMC Bioinformatics. 2013 Jan 23;14:28. doi: 10.1186/1471-2105-14-28.
Population inference is an important problem in genetics used to remove population stratification in genome-wide association studies and to detect migration patterns or shared ancestry. An individual's genotype can be modeled as a probabilistic function of ancestral population memberships, Q, and the allele frequencies in those populations, P. The parameters, P and Q, of this binomial likelihood model can be inferred using slow sampling methods such as Markov Chain Monte Carlo methods or faster gradient based approaches such as sequential quadratic programming. This paper proposes a least-squares simplification of the binomial likelihood model motivated by a Euclidean interpretation of the genotype feature space. This results in a faster algorithm that easily incorporates the degree of admixture within the sample of individuals and improves estimates without requiring trial-and-error tuning.
We show that the expected value of the least-squares solution across all possible genotype datasets is equal to the true solution when part of the problem has been solved, and that the variance of the solution approaches zero as its size increases. The Least-squares algorithm performs nearly as well as Admixture for these theoretical scenarios. We compare least-squares, Admixture, and FRAPPE for a variety of problem sizes and difficulties. For particularly hard problems with a large number of populations, small number of samples, or greater degree of admixture, least-squares performs better than the other methods. On simulated mixtures of real population allele frequencies from the HapMap project, Admixture estimates sparsely mixed individuals better than Least-squares. The least-squares approach, however, performs within 1.5% of the Admixture error. On individual genotypes from the HapMap project, Admixture and least-squares perform qualitatively similarly and within 1.2% of each other. Significantly, the least-squares approach nearly always converges 1.5- to 6-times faster.
The computational advantage of the least-squares approach along with its good estimation performance warrants further research, especially for very large datasets. As problem sizes increase, the difference in estimation performance between all algorithms decreases. In addition, when prior information is known, the least-squares approach easily incorporates the expected degree of admixture to improve the estimate.
群体推断是遗传学中的一个重要问题,用于消除全基因组关联研究中的群体分层,并检测迁移模式或共享祖先。个体的基因型可以建模为祖先群体成员身份 Q 和这些群体中等位基因频率 P 的概率函数。可以使用缓慢的采样方法(如马尔可夫链蒙特卡罗方法)或更快的基于梯度的方法(如顺序二次规划)来推断这个二项式似然模型的参数 P 和 Q。本文提出了一种二项式似然模型的最小二乘法简化,其动机是基因型特征空间的欧几里得解释。这导致了一种更快的算法,它可以轻松地将混合程度纳入个体样本中,并在无需反复试验调整的情况下提高估计值。
我们表明,当部分问题得到解决时,所有可能的基因型数据集的最小二乘解的期望值等于真实解,并且随着解的大小增加,解的方差趋近于零。最小二乘法在这些理论情况下的表现几乎与 Admixture 一样好。我们比较了最小二乘法、Admixture 和 FRAPPE 对于各种问题大小和难度。对于特别困难的问题,例如具有大量群体、少量样本或更高混合程度的问题,最小二乘法的表现优于其他方法。对于来自 HapMap 项目的真实群体等位基因频率的模拟混合物,Admixture 估计稀疏混合个体比最小二乘法好。然而,最小二乘法的方法在 1.5%以内的 Admixture 误差。对于 HapMap 项目中的个体基因型,Admixture 和最小二乘法的表现定性相似,彼此之间相差 1.2%。显著地,最小二乘法方法几乎总是快 1.5 到 6 倍收敛。
最小二乘法方法的计算优势及其良好的估计性能值得进一步研究,特别是对于非常大的数据集。随着问题规模的增加,所有算法之间的估计性能差异减小。此外,当有先验信息时,最小二乘法方法可以轻松地纳入预期的混合程度以提高估计值。