The Bioinformatics Centre, Department of Biology, University of Copenhagen, DK-2200, Denmark
The Bioinformatics Centre, Department of Biology, University of Copenhagen, DK-2200, Denmark.
Genetics. 2018 Oct;210(2):719-731. doi: 10.1534/genetics.118.301336. Epub 2018 Aug 21.
We here present two methods for inferring population structure and admixture proportions in low-depth next-generation sequencing (NGS) data. Inference of population structure is essential in both population genetics and association studies, and is often performed using principal component analysis (PCA) or clustering-based approaches. NGS methods provide large amounts of genetic data but are associated with statistical uncertainty, especially for low-depth sequencing data. Models can account for this uncertainty by working directly on genotype likelihoods of the unobserved genotypes. We propose a method for inferring population structure through PCA in an iterative heuristic approach of estimating individual allele frequencies, where we demonstrate improved accuracy in samples with low and variable sequencing depth for both simulated and real datasets. We also use the estimated individual allele frequencies in a fast non-negative matrix factorization method to estimate admixture proportions. Both methods have been implemented in the PCAngsd framework available at http://www.popgen.dk/software/.
我们在这里介绍了两种用于推断低深度下一代测序(NGS)数据中的群体结构和混合比例的方法。在群体遗传学和关联研究中,群体结构的推断是必不可少的,通常使用主成分分析(PCA)或基于聚类的方法来进行。NGS 方法提供了大量的遗传数据,但与统计不确定性有关,特别是对于低深度测序数据。模型可以通过直接处理未观察到的基因型的基因型似然来解释这种不确定性。我们提出了一种通过 PCA 进行群体结构推断的方法,该方法采用迭代启发式方法来估计个体等位基因频率,我们在模拟和真实数据集的低测序深度和可变测序深度的样本中展示了改进的准确性。我们还使用估计的个体等位基因频率在快速非负矩阵分解方法中估计混合比例。这两种方法都已在 PCAngsd 框架中实现,可在 http://www.popgen.dk/software/ 上获得。