Institute of Statistical Science, Academia Sinica, 128 Academia Road, Section 2, Taipei, Taiwan.
Sci Rep. 2021 Sep 7;11(1):17741. doi: 10.1038/s41598-021-97129-2.
Principal Component Analysis (PCA) projects high-dimensional genotype data into a few components that discern populations. Ancestry Informative Markers (AIMs) are a small subset of SNPs capable of distinguishing populations. We integrate these two approaches by proposing an algorithm to identify necessary informative loci whose removal from the data deteriorates the PCA structure. Unlike classical AIMs, necessary informative loci densely cover the genome, hence can illuminate the evolution and mixing history of populations. We conduct a comprehensive analysis to the genotype data of the 1000 Genomes Project using necessary informative loci. Projections along the top seven principal components demarcate populations at distinct geographic levels. Millions of necessary informative loci along each PC are identified. Population identities along each PC are approximately determined by weighted sums of minor (or major) alleles over the informative loci. Variations of allele frequencies are aligned with the history and direction of population evolution. The population distribution of projections along the top three PCs is recapitulated by a simple demographic model based on several waves of founder population separation and mixing. Informative loci possess locational concentration in the genome and functional enrichment. Genes at two hot spots encompassing dense PC 7 informative loci exhibit differential expressions among European populations. The mosaic of local ancestry in the genome of a mixed descendant from multiple populations can be inferred from partial PCA projections of informative loci. Finally, informative loci derived from the 1000 Genomes data well predict the projections of an independent genotype data of South Asians. These results demonstrate the utility and relevance of informative loci to investigate human evolution.
主成分分析(PCA)将高维基因型数据投影到几个能够区分群体的成分中。祖先信息标记(AIMs)是能够区分群体的少数 SNP 子集。我们通过提出一种算法来识别必要的信息性基因座,这些基因座的去除会降低 PCA 结构,从而将这两种方法结合起来。与经典的 AIMs 不同,必要的信息性基因座密集覆盖基因组,因此可以阐明群体的进化和混合历史。我们使用必要的信息性基因座对 1000 个基因组项目的基因型数据进行了全面分析。沿着前七个主成分的投影在不同地理水平上划分了群体。在每个 PC 上识别出数百万个必要的信息性基因座。每个 PC 上的群体身份大约是通过在信息性基因座上的次要(或主要)等位基因的加权和来确定的。等位基因频率的变化与群体进化的历史和方向一致。基于几个创始人种群分离和混合的波次的简单人口模型再现了沿着前三个 PC 的投影的种群分布。在基因组中,信息性基因座具有位置集中性和功能富集性。包含密集 PC7 信息性基因座的两个热点区域的基因在欧洲人群中表现出差异表达。从信息性基因座的部分 PCA 投影可以推断出来自多个群体的混合后代的基因组中的局部祖先镶嵌。最后,从 1000 个基因组数据中提取的信息性基因座可以很好地预测南亚人的独立基因型数据的投影。这些结果证明了信息性基因座在研究人类进化方面的实用性和相关性。