Division of Biostatistics and Epidemiology, University of Cincinnati, Cincinnati, OH, 45229, USA.
Department of Pediatrics, Cincinnati Children's Hospital Medical Center, University of Cincinnati, Cincinnati, OH, 45229, USA.
Sci Rep. 2019 Jul 31;9(1):11103. doi: 10.1038/s41598-019-47012-y.
Next-generation sequencing technologies now make it possible to sequence and genotype hundreds of thousands of genetic markers across the human genome. Selection of informative markers for the comprehensive characterization of individual genomic makeup using a high dimensional genomics dataset has become a common practice in evolutionary biology and human genetics. Although several feature selection approaches exist to determine the ancestry proportion in two-way admixed populations including African Americans, there are limited statistical tools developed for the feature selection approaches in three-way admixed populations (including Latino populations). Herein, we present a new likelihood-based feature selection method called Lancaster Estimator of Independence (LEI) that utilizes allele frequency information to prioritize the most informative features useful to determine ancestry proportion from multiple ancestral populations in admixed individuals. The ability of LEI to leverage summary-level statistics from allele frequency data, thereby avoiding the many restrictions (and big data issues) that can accompany access to individual-level genotype data, is appealing to minimize the computation and time-consuming ancestry inference in an admixed population. We compared our allele-frequency based approach with genotype-based approach in estimating admixed proportions in three-way admixed population scenarios. Our results showed ancestry estimates using the top-ranked features from LEI were comparable with the estimates using features from genotype-based methods in three-way admixed population. We provide an easy-to-use R code to assist researchers in using the LEI tool to develop allele frequency-based informative features to conduct admixture mapping studies from mixed samples of multiple ancestry origin.
下一代测序技术现在使得对人类基因组中的数十万遗传标记进行测序和基因分型成为可能。在进化生物学和人类遗传学中,使用高维基因组数据集对个体基因组组成进行全面描述,选择信息丰富的标记已成为一种常见做法。虽然存在几种特征选择方法来确定双向混合人群(包括非裔美国人)中的祖先比例,但针对包括拉丁裔人群在内的三方混合人群的特征选择方法开发的统计工具有限。在此,我们提出了一种新的基于似然的特征选择方法,称为独立兰开斯特估计器(LEI),它利用等位基因频率信息来确定最有用的特征,这些特征可用于从混合个体中的多个祖先群体中确定祖先比例。LEI 能够利用等位基因频率数据的汇总统计信息,从而避免访问个体水平基因型数据所带来的许多限制(和大数据问题),这对于最小化混合人群中的计算和耗时的祖先推断具有吸引力。我们比较了基于等位基因频率的方法和基于基因型的方法在三方混合人群中估计混合比例的能力。我们的结果表明,使用 LEI 中排名最高的特征进行的祖先估计与使用基于基因型方法的特征进行的估计在三方混合人群中相当。我们提供了一个易于使用的 R 代码,以帮助研究人员使用 LEI 工具开发基于等位基因频率的信息丰富特征,从而从具有多种祖先来源的混合样本中进行混合映射研究。