Department of Statistics, University of Washington, Seattle, WA 98195, Department of Human Genetics, University of Chicago, Chicago, IL 60637 and Department of Biomathematics, Human Genetics, and Statistics, University of California Los Angeles, Los Angeles, CA 90095, USA.
Bioinformatics. 2014 Oct 15;30(20):2915-22. doi: 10.1093/bioinformatics/btu418. Epub 2014 Jul 9.
Unique modeling and computational challenges arise in locating the geographic origin of individuals based on their genetic backgrounds. Single-nucleotide polymorphisms (SNPs) vary widely in informativeness, allele frequencies change non-linearly with geography and reliable localization requires evidence to be integrated across a multitude of SNPs. These problems become even more acute for individuals of mixed ancestry. It is hardly surprising that matching genetic models to computational constraints has limited the development of methods for estimating geographic origins. We attack these related problems by borrowing ideas from image processing and optimization theory. Our proposed model divides the region of interest into pixels and operates SNP by SNP. We estimate allele frequencies across the landscape by maximizing a product of binomial likelihoods penalized by nearest neighbor interactions. Penalization smooths allele frequency estimates and promotes estimation at pixels with no data. Maximization is accomplished by a minorize-maximize (MM) algorithm. Once allele frequency surfaces are available, one can apply Bayes' rule to compute the posterior probability that each pixel is the pixel of origin of a given person. Placement of admixed individuals on the landscape is more complicated and requires estimation of the fractional contribution of each pixel to a person's genome. This estimation problem also succumbs to a penalized MM algorithm.
We applied the model to the Population Reference Sample (POPRES) data. The model gives better localization for both unmixed and admixed individuals than existing methods despite using just a small fraction of the available SNPs. Computing times are comparable with the best competing software.
Software will be freely available as the OriGen package in R.
ranolaj@uw.edu or klange@ucla.edu
Supplementary data are available at Bioinformatics online.
根据个体的遗传背景定位其地理来源,会带来独特的建模和计算挑战。单核苷酸多态性(SNP)在信息量方面差异很大,等位基因频率随地理分布呈非线性变化,可靠的定位需要整合大量 SNP 的证据。对于混合血统的个体,这些问题变得更加严重。毫不奇怪,将遗传模型与计算约束相匹配,限制了估计地理起源的方法的发展。我们通过借鉴图像处理和优化理论的思想来解决这些相关问题。我们提出的模型将感兴趣的区域划分为像素,并逐 SNP 进行操作。我们通过最大化二项式似然的乘积来估计整个景观中的等位基因频率,该乘积受到最近邻相互作用的惩罚。惩罚平滑等位基因频率估计值,并促进在没有数据的像素处进行估计。最大化通过最小化最大化(MM)算法来完成。一旦获得等位基因频率曲面,就可以应用贝叶斯法则计算每个像素是给定个体起源像素的后验概率。混合个体在景观上的定位更加复杂,需要估计每个像素对个体基因组的分数贡献。这个估计问题也屈服于惩罚 MM 算法。
我们将该模型应用于人口参考样本(POPRES)数据。尽管只使用了可用 SNP 的一小部分,但该模型在定位未混合和混合个体方面都优于现有的方法。计算时间与最好的竞争软件相当。
软件将作为 R 中的 OriGen 包免费提供。
ranolaj@uw.edu 或 klange@ucla.edu
补充数据可在 Bioinformatics 在线获取。