一种快速的用于群体推断的最小二乘法。

The Wallace H, Coulter Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, Atlanta, GA 30332, USA.

BMC Bioinformatics. 2013 Jan 23;14:28. doi: 10.1186/1471-2105-14-28.

BACKGROUND

Population inference is an important problem in genetics used to remove population stratification in genome-wide association studies and to detect migration patterns or shared ancestry. An individual's genotype can be modeled as a probabilistic function of ancestral population memberships, Q, and the allele frequencies in those populations, P. The parameters, P and Q, of this binomial likelihood model can be inferred using slow sampling methods such as Markov Chain Monte Carlo methods or faster gradient based approaches such as sequential quadratic programming. This paper proposes a least-squares simplification of the binomial likelihood model motivated by a Euclidean interpretation of the genotype feature space. This results in a faster algorithm that easily incorporates the degree of admixture within the sample of individuals and improves estimates without requiring trial-and-error tuning.

RESULTS

We show that the expected value of the least-squares solution across all possible genotype datasets is equal to the true solution when part of the problem has been solved, and that the variance of the solution approaches zero as its size increases. The Least-squares algorithm performs nearly as well as Admixture for these theoretical scenarios. We compare least-squares, Admixture, and FRAPPE for a variety of problem sizes and difficulties. For particularly hard problems with a large number of populations, small number of samples, or greater degree of admixture, least-squares performs better than the other methods. On simulated mixtures of real population allele frequencies from the HapMap project, Admixture estimates sparsely mixed individuals better than Least-squares. The least-squares approach, however, performs within 1.5% of the Admixture error. On individual genotypes from the HapMap project, Admixture and least-squares perform qualitatively similarly and within 1.2% of each other. Significantly, the least-squares approach nearly always converges 1.5- to 6-times faster.

CONCLUSIONS

The computational advantage of the least-squares approach along with its good estimation performance warrants further research, especially for very large datasets. As problem sizes increase, the difference in estimation performance between all algorithms decreases. In addition, when prior information is known, the least-squares approach easily incorporates the expected degree of admixture to improve the estimate.

背景

群体推断是遗传学中的一个重要问题，用于消除全基因组关联研究中的群体分层，并检测迁移模式或共享祖先。个体的基因型可以建模为祖先群体成员身份 Q 和这些群体中等位基因频率 P 的概率函数。可以使用缓慢的采样方法（如马尔可夫链蒙特卡罗方法）或更快的基于梯度的方法（如顺序二次规划）来推断这个二项式似然模型的参数 P 和 Q。本文提出了一种二项式似然模型的最小二乘法简化，其动机是基因型特征空间的欧几里得解释。这导致了一种更快的算法，它可以轻松地将混合程度纳入个体样本中，并在无需反复试验调整的情况下提高估计值。

结果

我们表明，当部分问题得到解决时，所有可能的基因型数据集的最小二乘解的期望值等于真实解，并且随着解的大小增加，解的方差趋近于零。最小二乘法在这些理论情况下的表现几乎与 Admixture 一样好。我们比较了最小二乘法、Admixture 和 FRAPPE 对于各种问题大小和难度。对于特别困难的问题，例如具有大量群体、少量样本或更高混合程度的问题，最小二乘法的表现优于其他方法。对于来自 HapMap 项目的真实群体等位基因频率的模拟混合物，Admixture 估计稀疏混合个体比最小二乘法好。然而，最小二乘法的方法在 1.5%以内的 Admixture 误差。对于 HapMap 项目中的个体基因型，Admixture 和最小二乘法的表现定性相似，彼此之间相差 1.2%。显著地，最小二乘法方法几乎总是快 1.5 到 6 倍收敛。

结论

最小二乘法方法的计算优势及其良好的估计性能值得进一步研究，特别是对于非常大的数据集。随着问题规模的增加，所有算法之间的估计性能差异减小。此外，当有先验信息时，最小二乘法方法可以轻松地纳入预期的混合程度以提高估计值。

相似文献

A fast least-squares algorithm for population inference.

BMC Bioinformatics. 2013 Jan 23;14:28. doi: 10.1186/1471-2105-14-28.

Fast individual ancestry inference from DNA sequence data leveraging allele frequencies for multiple populations.

BMC Bioinformatics. 2015 Jan 16;16:4. doi: 10.1186/s12859-014-0418-7.

Fast model-based estimation of ancestry in unrelated individuals.

Genome Res. 2009 Sep;19(9):1655-64. doi: 10.1101/gr.094052.109. Epub 2009 Jul 31.

Inferring the ancestry of parents and grandparents from genetic data.

PLoS Comput Biol. 2020 Aug 14;16(8):e1008065. doi: 10.1371/journal.pcbi.1008065. eCollection 2020 Aug.

A classical likelihood based approach for admixture mapping using EM algorithm.

Hum Genet. 2006 Oct;120(3):431-45. doi: 10.1007/s00439-006-0224-z. Epub 2006 Aug 5.

On the inference of ancestries in admixed populations.

Genome Res. 2008 Apr;18(4):668-75. doi: 10.1101/gr.072751.107. Epub 2008 Mar 18.

Multiway admixture deconvolution using phased or unphased ancestral panels.

Genet Epidemiol. 2013 Jan;37(1):1-12. doi: 10.1002/gepi.21692. Epub 2012 Nov 7.

Estimating individual admixture proportions from next generation sequencing data.

Genetics. 2013 Nov;195(3):693-702. doi: 10.1534/genetics.113.154138. Epub 2013 Sep 11.

MI-MAAP: marker informativeness for multi-ancestry admixed populations.

BMC Bioinformatics. 2020 Apr 3;21(1):131. doi: 10.1186/s12859-020-3462-5.

Enhancements to the ADMIXTURE algorithm for individual ancestry estimation.

BMC Bioinformatics. 2011 Jun 18;12:246. doi: 10.1186/1471-2105-12-246.

引用本文的文献

Fast and efficient estimation of individual ancestry coefficients.

Genetics. 2014 Apr;196(4):973-83. doi: 10.1534/genetics.113.160572. Epub 2014 Feb 4.

本文引用的文献

Perspectives on human population structure at the cusp of the sequencing era.

Annu Rev Genomics Hum Genet. 2011;12:245-74. doi: 10.1146/annurev-genom-090810-183123.

Enhancements to the ADMIXTURE algorithm for individual ancestry estimation.

BMC Bioinformatics. 2011 Jun 18;12:246. doi: 10.1186/1471-2105-12-246.

A genealogical interpretation of principal components analysis.

PLoS Genet. 2009 Oct;5(10):e1000686. doi: 10.1371/journal.pgen.1000686. Epub 2009 Oct 16.

Fast model-based estimation of ancestry in unrelated individuals.

Genome Res. 2009 Sep;19(9):1655-64. doi: 10.1101/gr.094052.109. Epub 2009 Jul 31.

PCA-based population structure inference with generic clustering algorithms.

BMC Bioinformatics. 2009 Jan 30;10 Suppl 1(Suppl 1):S73. doi: 10.1186/1471-2105-10-S1-S73.

Interpreting principal component analyses of spatial population genetic variation.

Nat Genet. 2008 May;40(5):646-9. doi: 10.1038/ng.139. Epub 2008 Apr 20.

Population structure and eigenanalysis.

PLoS Genet. 2006 Dec;2(12):e190. doi: 10.1371/journal.pgen.0020190.

Principal components analysis corrects for stratification in genome-wide association studies.

Nat Genet. 2006 Aug;38(8):904-9. doi: 10.1038/ng1847. Epub 2006 Jul 23.

PSMIX: an R package for population structure inference via maximum likelihood method.

BMC Bioinformatics. 2006 Jun 22;7:317. doi: 10.1186/1471-2105-7-317.

A haplotype map of the human genome.

Nature. 2005 Oct 27;437(7063):1299-320. doi: 10.1038/nature04226.

Suppr 超能文献

核心技术专利：CN118964589B侵权必究

相似文献

A fast least-squares algorithm for population inference.

BMC Bioinformatics. 2013 Jan 23;14:28. doi: 10.1186/1471-2105-14-28.

Fast individual ancestry inference from DNA sequence data leveraging allele frequencies for multiple populations.

BMC Bioinformatics. 2015 Jan 16;16:4. doi: 10.1186/s12859-014-0418-7.

Fast model-based estimation of ancestry in unrelated individuals.

Genome Res. 2009 Sep;19(9):1655-64. doi: 10.1101/gr.094052.109. Epub 2009 Jul 31.

Inferring the ancestry of parents and grandparents from genetic data.

PLoS Comput Biol. 2020 Aug 14;16(8):e1008065. doi: 10.1371/journal.pcbi.1008065. eCollection 2020 Aug.

A classical likelihood based approach for admixture mapping using EM algorithm.

Hum Genet. 2006 Oct;120(3):431-45. doi: 10.1007/s00439-006-0224-z. Epub 2006 Aug 5.

On the inference of ancestries in admixed populations.

Genome Res. 2008 Apr;18(4):668-75. doi: 10.1101/gr.072751.107. Epub 2008 Mar 18.

Multiway admixture deconvolution using phased or unphased ancestral panels.

Genet Epidemiol. 2013 Jan;37(1):1-12. doi: 10.1002/gepi.21692. Epub 2012 Nov 7.

Estimating individual admixture proportions from next generation sequencing data.

Genetics. 2013 Nov;195(3):693-702. doi: 10.1534/genetics.113.154138. Epub 2013 Sep 11.

MI-MAAP: marker informativeness for multi-ancestry admixed populations.

BMC Bioinformatics. 2020 Apr 3;21(1):131. doi: 10.1186/s12859-020-3462-5.

Enhancements to the ADMIXTURE algorithm for individual ancestry estimation.

BMC Bioinformatics. 2011 Jun 18;12:246. doi: 10.1186/1471-2105-12-246.

引用本文的文献

Fast and efficient estimation of individual ancestry coefficients.

Genetics. 2014 Apr;196(4):973-83. doi: 10.1534/genetics.113.160572. Epub 2014 Feb 4.

本文引用的文献

Perspectives on human population structure at the cusp of the sequencing era.

Annu Rev Genomics Hum Genet. 2011;12:245-74. doi: 10.1146/annurev-genom-090810-183123.

Enhancements to the ADMIXTURE algorithm for individual ancestry estimation.

BMC Bioinformatics. 2011 Jun 18;12:246. doi: 10.1186/1471-2105-12-246.

A genealogical interpretation of principal components analysis.

PLoS Genet. 2009 Oct;5(10):e1000686. doi: 10.1371/journal.pgen.1000686. Epub 2009 Oct 16.

Fast model-based estimation of ancestry in unrelated individuals.

Genome Res. 2009 Sep;19(9):1655-64. doi: 10.1101/gr.094052.109. Epub 2009 Jul 31.

PCA-based population structure inference with generic clustering algorithms.

BMC Bioinformatics. 2009 Jan 30;10 Suppl 1(Suppl 1):S73. doi: 10.1186/1471-2105-10-S1-S73.

Interpreting principal component analyses of spatial population genetic variation.

Nat Genet. 2008 May;40(5):646-9. doi: 10.1038/ng.139. Epub 2008 Apr 20.

Population structure and eigenanalysis.

PLoS Genet. 2006 Dec;2(12):e190. doi: 10.1371/journal.pgen.0020190.

Principal components analysis corrects for stratification in genome-wide association studies.

Nat Genet. 2006 Aug;38(8):904-9. doi: 10.1038/ng1847. Epub 2006 Jul 23.

PSMIX: an R package for population structure inference via maximum likelihood method.

BMC Bioinformatics. 2006 Jun 22;7:317. doi: 10.1186/1471-2105-7-317.

A haplotype map of the human genome.

Nature. 2005 Oct 27;437(7063):1299-320. doi: 10.1038/nature04226.

A fast least-squares algorithm for population inference.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献