Peña-Malavera Andrea, Bruno Cecilia, Fernandez Elmer, Balzarini Monica
Stat Appl Genet Mol Biol. 2014 Aug;13(4):391-402. doi: 10.1515/sagmb-2013-0006.
Identifying population genetic structure (PGS) is crucial for breeding and conservation. Several clustering algorithms are available to identify the underlying PGS to be used with genetic data of maize genotypes. In this work, six methods to identify PGS from unlinked molecular marker data were compared using simulated and experimental data consisting of multilocus-biallelic genotypes. Datasets were delineated under different biological scenarios characterized by three levels of genetic divergence among populations (low, medium, and high FST) and two numbers of sub-populations (K=3 and K=5). The relative performance of hierarchical and non-hierarchical clustering, as well as model-based clustering (STRUCTURE) and clustering from neural networks (SOM-RP-Q). We use the clustering error rate of genotypes into discrete sub-populations as comparison criterion. In scenarios with great level of divergence among genotype groups all methods performed well. With moderate level of genetic divergence (FST=0.2), the algorithms SOM-RP-Q and STRUCTURE performed better than hierarchical and non-hierarchical clustering. In all simulated scenarios with low genetic divergence and in the experimental SNP maize panel (largely unlinked), SOM-RP-Q achieved the lowest clustering error rate. The SOM algorithm used here is more effective than other evaluated methods for sparse unlinked genetic data.
识别群体遗传结构(PGS)对于育种和保护至关重要。有几种聚类算法可用于识别潜在的PGS,以用于玉米基因型的遗传数据。在这项工作中,使用由多位点双等位基因基因型组成的模拟和实验数据,比较了六种从未连锁分子标记数据中识别PGS的方法。数据集是在不同的生物学场景下划定的,其特征是群体间遗传分化的三个水平(低、中、高FST)和两个亚群体数量(K = 3和K = 5)。比较了层次聚类和非层次聚类、基于模型的聚类(STRUCTURE)以及神经网络聚类(SOM-RP-Q)的相对性能。我们将基因型聚类到离散亚群体的聚类错误率作为比较标准。在基因型组间差异程度较大的情况下,所有方法都表现良好。在遗传分化程度中等(FST = 0.2)的情况下,SOM-RP-Q和STRUCTURE算法的表现优于层次聚类和非层次聚类。在所有低遗传分化的模拟场景以及实验性SNP玉米面板(大多未连锁)中,SOM-RP-Q实现了最低的聚类错误率。这里使用的SOM算法对于稀疏未连锁遗传数据比其他评估方法更有效。