Department of Biology, Stanford University, Stanford, California, USA.
Rare Cancers Genomics Team (RCG), Genomic Epidemiology Branch (GEM), International Agency for Research on Cancer/World Health Organisation (IARC/WHO), Lyon, France.
Mol Ecol Resour. 2022 Oct;22(7):2614-2626. doi: 10.1111/1755-0998.13647. Epub 2022 Jul 20.
In model-based inference of population structure from individual-level genetic data, individuals are assigned membership coefficients in a series of statistical clusters generated by clustering algorithms. Distinct patterns of variability in membership coefficients can be produced for different groups of individuals, for example, representing different predefined populations, sampling sites or time periods. Such variability can be difficult to capture in a single numerical value; membership coefficient vectors are multivariate and potentially incommensurable across predefined groups, as the number of clusters over which individuals are distributed can vary among groups of interest. Further, two groups might share few clusters in common, so that membership coefficient vectors are concentrated on different clusters. We introduce a method for measuring the variability of membership coefficients of individuals in a predefined group, making use of an analogy between variability across individuals in membership coefficient vectors and variation across populations in allele frequency vectors. We show that in a model in which membership coefficient vectors in a population follow a Dirichlet distribution, the measure increases linearly with a parameter describing the variance of a specified component of the membership vector and does not depend on its mean. We apply the approach, which makes use of a normalized F statistic, to data on inferred population structure in three example scenarios. We also introduce a bootstrap test for equivalence of two or more predefined groups in their level of membership coefficient variability. Our methods are implemented in the r package FSTruct.
在基于模型的个体水平遗传数据群体结构推断中,个体被分配给聚类算法生成的一系列统计聚类的成员系数。不同个体群体的成员系数可以产生不同的可变性模式,例如,代表不同的预定义群体、采样地点或时间段。这种可变性很难用单个数值来捕捉;成员系数向量是多变量的,并且在预定义的群体之间可能不可比,因为个体分布的聚类数量在感兴趣的群体之间可能有所不同。此外,两个群体可能很少有共同的聚类,因此成员系数向量集中在不同的聚类上。我们引入了一种测量预定义群体中个体成员系数可变性的方法,利用成员系数向量中个体之间的可变性与等位基因频率向量中群体之间的变异之间的类比。我们表明,在一个群体成员系数向量遵循 Dirichlet 分布的模型中,该度量与指定成员向量分量的方差描述的参数呈线性增加,并且不依赖于其均值。我们应用了该方法,该方法利用了标准化 F 统计量,对三个示例场景中推断的群体结构数据进行了分析。我们还引入了一个用于检验两个或更多预定义群体成员系数可变性水平等效性的自举检验。我们的方法在 r 包 FSTruct 中实现。