Suppr超能文献

利用基因数据进行人群识别。

Population identification using genetic data.

机构信息

Heilbronn Institute for Mathematical Research, School of Mathematics, University of Bristol, Bristol BS8 1TW, UK.

出版信息

Annu Rev Genomics Hum Genet. 2012;13:337-61. doi: 10.1146/annurev-genom-082410-101510. Epub 2012 Jun 11.

Abstract

A large number of algorithms have been developed to classify individuals into discrete populations using genetic data. Recent results show that the information used by both model-based clustering methods and principal components analysis can be summarized by a matrix of pairwise similarity measures between individuals. Similarity matrices have been constructed in a number of ways, usually treating markers as independent but differing in the weighting given to polymorphisms of different frequencies. Additionally, methods are now being developed that take linkage into account. We review several such matrices and evaluate their information content. A two-stage approach for population identification is to first construct a similarity matrix and then perform clustering. We review a range of common clustering algorithms and evaluate their performance through a simulation study. The clustering step can be performed either on the matrix or by first using a dimension-reduction technique; we find that the latter approach substantially improves the performance of most algorithms. Based on these results, we describe the population structure signal contained in each similarity matrix and find that accounting for linkage leads to significant improvements for sequence data. We also perform a comparison on real data, where we find that population genetics models outperform generic clustering approaches, particularly with regard to robustness for features such as relatedness between individuals.

摘要

已经开发出大量算法来使用遗传数据将个体分类到离散群体中。最近的结果表明,基于模型的聚类方法和主成分分析所使用的信息可以用个体之间的成对相似性度量矩阵来总结。相似性矩阵已经以多种方式构建,通常将标记视为独立的,但在赋予不同频率多态性的权重方面有所不同。此外,现在正在开发考虑连锁的方法。我们回顾了几种这样的矩阵,并评估了它们的信息含量。一种用于群体识别的两阶段方法是首先构建相似性矩阵,然后进行聚类。我们回顾了一系列常见的聚类算法,并通过模拟研究评估它们的性能。聚类步骤可以在矩阵上执行,也可以首先使用降维技术执行;我们发现后者方法大大提高了大多数算法的性能。基于这些结果,我们描述了每个相似性矩阵中包含的群体结构信号,并发现连锁的考虑会显著提高序列数据的性能。我们还在真实数据上进行了比较,发现群体遗传学模型优于通用聚类方法,尤其是在个体之间的亲缘关系等特征的稳健性方面。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验