The Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, 69978, Israel.
Am J Hum Genet. 2013 Jun 6;92(6):882-94. doi: 10.1016/j.ajhg.2013.04.023. Epub 2013 May 30.
Characterizing the spatial patterns of genetic diversity in human populations has a wide range of applications, from detecting genetic mutations associated with disease to inferring human history. Current approaches, including the widely used principal-component analysis, are not suited for the analysis of linked markers, and local and long-range linkage disequilibrium (LD) can dramatically reduce the accuracy of spatial localization when unaccounted for. To overcome this, we have introduced an approach that performs spatial localization of individuals on the basis of their genetic data and explicitly models LD among markers by using a multivariate normal distribution. By leveraging external reference panels, we derive closed-form solutions to the optimization procedure to achieve a computationally efficient method that can handle large data sets. We validate the method on empirical data from a large sample of European individuals from the POPRES data set, as well as on a large sample of individuals of Spanish ancestry. First, we show that by modeling LD, we achieve accuracy superior to that of existing methods. Importantly, whereas other methods show decreased performance when dense marker panels are used in the inference, our approach improves in accuracy as more markers become available. Second, we show that accurate localization of genetic data can be achieved with only a part of the genome, and this could potentially enable the spatial localization of admixed samples that have a fraction of their genome originating from a given continent. Finally, we demonstrate that our approach is resistant to distortions resulting from long-range LD regions; such distortions can dramatically bias the results when unaccounted for.
描述人类群体遗传多样性的空间模式具有广泛的应用,从检测与疾病相关的遗传突变到推断人类历史。目前的方法,包括广泛使用的主成分分析,并不适合分析连锁标记,局部和长程连锁不平衡(LD)在未被考虑时会极大地降低空间定位的准确性。为了克服这一问题,我们引入了一种方法,该方法基于个体的遗传数据对其进行空间定位,并通过使用多元正态分布来明确地对标记之间的 LD 进行建模。通过利用外部参考面板,我们为优化过程推导出了闭式解,以实现一种计算效率高的方法,能够处理大数据集。我们在来自 POPRES 数据集的大量欧洲个体的实证数据以及大量西班牙裔个体的实证数据上验证了该方法。首先,我们表明,通过对 LD 进行建模,我们实现了优于现有方法的准确性。重要的是,虽然其他方法在推断中使用密集标记面板时性能下降,但我们的方法随着更多标记的可用而准确性提高。其次,我们表明,仅使用基因组的一部分就可以实现遗传数据的准确定位,这可能使来自特定大陆的一部分基因组的混合样本的空间定位成为可能。最后,我们证明了我们的方法对长程 LD 区域导致的扭曲具有抵抗力;如果不考虑这些扭曲,它们会极大地影响结果。