Ahn Jaeil, Conkright Brian, Boca Simina M, Madhavan Subha
1 Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University , Washington, District of Columbia.
2 Innovation Center for Biomedical Informatics, Georgetown University , Washington, District of Columbia.
J Comput Biol. 2018 Apr;25(4):417-429. doi: 10.1089/cmb.2017.0127. Epub 2018 Jan 2.
Statistical approaches for population structure estimation have been predominantly driven by a particular data type, single-nucleotide polymorphisms (SNPs). However, in the presence of weak identifiability in SNPs, population structure estimation can suffer from undesirable accuracy loss. Copy number variations (CNVs) are genomic structural variants with loci that are commonly shared within a specific population and thus provide valuable information for estimation of the ancestry of sampled populations. We develop a Bayesian joint modeling framework of SNPs and CNVs, called POPSTR, to better understand population structure than approaches that use SNPs solely. To deal with the increased data volume, we use the Metropolis Adjusted Langevin algorithm (MALA) that guides the target distribution in a computationally efficient way. We illustrate applications of our approach using the HapMap 2005 project data. We carry out simulation studies and show that the performance of our approach is comparable or better than that of popular benchmarks, STRUCTURE and ADMIXTURE. We also observe that using only CNVs can be remarkably efficient if SNP data are not available.
用于群体结构估计的统计方法主要由特定的数据类型——单核苷酸多态性(SNP)驱动。然而,在SNP存在弱可识别性的情况下,群体结构估计可能会出现不理想的精度损失。拷贝数变异(CNV)是基因组结构变异,其位点在特定群体中通常是共享的,因此为估计抽样群体的祖先提供了有价值的信息。我们开发了一种SNP和CNV的贝叶斯联合建模框架,称为POPSTR,以比仅使用SNP的方法更好地理解群体结构。为了处理增加的数据量,我们使用了Metropolis调整朗之万算法(MALA),该算法以计算高效的方式引导目标分布。我们使用HapMap 2005项目数据说明了我们方法的应用。我们进行了模拟研究,并表明我们方法的性能与流行的基准方法STRUCTURE和ADMIXTURE相当或更好。我们还观察到,如果没有SNP数据,仅使用CNV可能会非常有效。