Institute of Zoology, Zoological Society of London, London, NW1 4RY, UK.
Heredity (Edinb). 2022 Aug;129(2):79-92. doi: 10.1038/s41437-022-00535-z. Epub 2022 May 4.
Model-based (likelihood and Bayesian) and non-model-based (PCA and K-means clustering) methods were developed to identify populations and assign individuals to the identified populations using marker genotype data. Model-based methods are favoured because they are based on a probabilistic model of population genetics with biologically meaningful parameters and thus produce results that are easily interpretable and applicable. Furthermore, they often yield more accurate structure inferences than non-model-based methods. However, current model-based methods either are computationally demanding and thus applicable to small problems only or use simplified admixture models that could yield inaccurate results in difficult situations such as unbalanced sampling. In this study, I propose new likelihood methods for fast and accurate population admixture inference using genotype data from a few multiallelic microsatellites to millions of diallelic SNPs. The methods conduct first a clustering analysis of coarse-grained population structure by using the mixture model and the simulated annealing algorithm, and then an admixture analysis of fine-grained population structure by using the clustering results as a starting point in an expectation maximisation algorithm. Extensive analyses of both simulated and empirical data show that the new methods compare favourably with existing methods in both accuracy and running speed. They can analyse small datasets with just a few multiallelic microsatellites but can also handle in parallel terabytes of data with millions of markers and millions of individuals. In difficult situations such as many and/or lowly differentiated populations, unbalanced or very small samples of individuals, the new methods are substantially more accurate than other methods.
采用基于模型(似然和贝叶斯)和非模型(主成分分析和 K 均值聚类)的方法,利用标记基因型数据识别群体并将个体分配到已识别的群体中。基于模型的方法更受欢迎,因为它们基于具有生物学意义参数的群体遗传学概率模型,因此产生的结果易于解释和应用。此外,它们通常比非基于模型的方法产生更准确的结构推断。然而,目前的基于模型的方法要么计算量大,因此仅适用于小问题,要么使用简化的混合模型,在不平衡采样等困难情况下可能会产生不准确的结果。在这项研究中,我提出了新的基于似然的方法,用于使用来自少数多等位基因微卫星的基因型数据快速准确地推断群体混合。该方法首先通过使用混合模型和模拟退火算法对粗粒度的群体结构进行聚类分析,然后使用聚类结果作为期望最大化算法的起点对细粒度的群体结构进行混合分析。对模拟和真实数据的广泛分析表明,新方法在准确性和运行速度方面都优于现有方法。它们可以分析只有少数多等位基因微卫星的小数据集,但也可以并行处理包含数百万个标记和数百万个个体的数 TB 数据。在许多和/或分化程度低的群体、不平衡或非常小的个体样本等困难情况下,新方法比其他方法准确得多。