University of Freiburg, Department of Mathematical Stochastics, Ernst-Zermelo-Straße 1, D-79104 Freiburg, Germany.
University of Freiburg, Faculty of Medicine and Medical Center, Institute of Genetic Epidemiology, Germany.
Forensic Sci Int Genet. 2020 May;46:102259. doi: 10.1016/j.fsigen.2020.102259. Epub 2020 Feb 15.
Inference of the Biogeographical Ancestry (BGA) of a person or trace relies on three ingredients: (1) a reference database of DNA samples including BGA information; (2) a statistical clustering method; (3) a set of loci which segregate dependent on geographical location, i.e. a set of so-called Ancestry Informative Markers (AIMs). We used the theory of feature selection from statistical learning in order to obtain AIMsets for BGA inference. Using simulations, we show that this learning procedure works in various cases, and outperforms ad hoc methods, based on statistics like F or informativeness for the choice of AIMs. Applying our method to data from the 1000 genomes project (excluding Admixed Americans) we identified an AIMset of 12 SNPs, which gives a vanishing misclassification error on a continental scale, as do other published AIMsets. In fact, cross validation shows that there exists a multitude of sets with comparable performance to the optimal AIMset. On a sub-continental scale, we find a set of 55 SNPs for distinguishing the five European populations. The misclassification error is reduced by a factor of two relative to published AIMsets, but is still 30% and therefore too large in order to be useful in forensic applications.
推断一个人或痕迹的生物地理祖先(BGA)依赖于三个要素:(1)包括 BGA 信息的 DNA 样本参考数据库;(2)统计聚类方法;(3)一组依赖于地理位置分离的基因座,即一组所谓的祖先信息标记(AIMs)。我们使用统计学习中的特征选择理论来获取用于 BGA 推断的 AIMsets。通过模拟,我们表明该学习过程在各种情况下都有效,并且优于基于统计的特定方法,例如 F 统计量或用于选择 AIMs 的信息量。将我们的方法应用于 1000 基因组计划(不包括混合美国人)的数据,我们确定了一个由 12 个 SNP 组成的 AIMset,它在大陆范围内的分类错误率为零,其他已发表的 AIMsets 也是如此。实际上,交叉验证表明存在许多与最优 AIMset 性能相当的集合。在次大陆范围内,我们发现了一组 55 个 SNP,用于区分五个欧洲人群。与已发表的 AIMsets 相比,分类错误率降低了两倍,但仍为 30%,因此对于法医应用来说太大了。