快速的空间亲缘关系推断方法——基于灵活的等位基因频率曲面。

Fast spatial ancestry via flexible allele frequency surfaces.

机构信息

Department of Statistics, University of Washington, Seattle, WA 98195, Department of Human Genetics, University of Chicago, Chicago, IL 60637 and Department of Biomathematics, Human Genetics, and Statistics, University of California Los Angeles, Los Angeles, CA 90095, USA.

出版信息

Bioinformatics. 2014 Oct 15;30(20):2915-22. doi: 10.1093/bioinformatics/btu418. Epub 2014 Jul 9.

DOI:10.1093/bioinformatics/btu418

PMID:25012181

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4184261/

Abstract

MOTIVATION

Unique modeling and computational challenges arise in locating the geographic origin of individuals based on their genetic backgrounds. Single-nucleotide polymorphisms (SNPs) vary widely in informativeness, allele frequencies change non-linearly with geography and reliable localization requires evidence to be integrated across a multitude of SNPs. These problems become even more acute for individuals of mixed ancestry. It is hardly surprising that matching genetic models to computational constraints has limited the development of methods for estimating geographic origins. We attack these related problems by borrowing ideas from image processing and optimization theory. Our proposed model divides the region of interest into pixels and operates SNP by SNP. We estimate allele frequencies across the landscape by maximizing a product of binomial likelihoods penalized by nearest neighbor interactions. Penalization smooths allele frequency estimates and promotes estimation at pixels with no data. Maximization is accomplished by a minorize-maximize (MM) algorithm. Once allele frequency surfaces are available, one can apply Bayes' rule to compute the posterior probability that each pixel is the pixel of origin of a given person. Placement of admixed individuals on the landscape is more complicated and requires estimation of the fractional contribution of each pixel to a person's genome. This estimation problem also succumbs to a penalized MM algorithm.

RESULTS

We applied the model to the Population Reference Sample (POPRES) data. The model gives better localization for both unmixed and admixed individuals than existing methods despite using just a small fraction of the available SNPs. Computing times are comparable with the best competing software.

AVAILABILITY AND IMPLEMENTATION

Software will be freely available as the OriGen package in R.

CONTACT

ranolaj@uw.edu or klange@ucla.edu

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

根据个体的遗传背景定位其地理来源，会带来独特的建模和计算挑战。单核苷酸多态性（SNP）在信息量方面差异很大，等位基因频率随地理分布呈非线性变化，可靠的定位需要整合大量 SNP 的证据。对于混合血统的个体，这些问题变得更加严重。毫不奇怪，将遗传模型与计算约束相匹配，限制了估计地理起源的方法的发展。我们通过借鉴图像处理和优化理论的思想来解决这些相关问题。我们提出的模型将感兴趣的区域划分为像素，并逐 SNP 进行操作。我们通过最大化二项式似然的乘积来估计整个景观中的等位基因频率，该乘积受到最近邻相互作用的惩罚。惩罚平滑等位基因频率估计值，并促进在没有数据的像素处进行估计。最大化通过最小化最大化（MM）算法来完成。一旦获得等位基因频率曲面，就可以应用贝叶斯法则计算每个像素是给定个体起源像素的后验概率。混合个体在景观上的定位更加复杂，需要估计每个像素对个体基因组的分数贡献。这个估计问题也屈服于惩罚 MM 算法。

结果

我们将该模型应用于人口参考样本（POPRES）数据。尽管只使用了可用 SNP 的一小部分，但该模型在定位未混合和混合个体方面都优于现有的方法。计算时间与最好的竞争软件相当。

可用性和实现

软件将作为 R 中的 OriGen 包免费提供。

联系方式

ranolaj@uw.edu 或 klange@ucla.edu

补充信息

补充数据可在 Bioinformatics 在线获取。

相似文献

Fast spatial ancestry via flexible allele frequency surfaces.快速的空间亲缘关系推断方法——基于灵活的等位基因频率曲面。

Bioinformatics. 2014 Oct 15;30(20):2915-22. doi: 10.1093/bioinformatics/btu418. Epub 2014 Jul 9.

Novel probabilistic models of spatial genetic ancestry with applications to stratification correction in genome-wide association studies.用于全基因组关联研究分层校正的空间遗传血统新型概率模型。

Bioinformatics. 2017 Mar 15;33(6):879-885. doi: 10.1093/bioinformatics/btw720.

ASAFE: ancestry-specific allele frequency estimation.ASAFE：特定血统等位基因频率估计。

Bioinformatics. 2016 Jul 15;32(14):2227-9. doi: 10.1093/bioinformatics/btw220. Epub 2016 May 3.

A unified approach for allele frequency estimation, SNP detection and association studies based on pooled sequencing data using EM algorithms.基于 EM 算法的基于测序数据的等位基因频率估计、SNP 检测和关联研究的统一方法。

BMC Genomics. 2013;14 Suppl 1(Suppl 1):S1. doi: 10.1186/1471-2164-14-S1-S1. Epub 2013 Jan 21.

Fast individual ancestry inference from DNA sequence data leveraging allele frequencies for multiple populations.利用多个群体的等位基因频率从DNA序列数据中快速推断个体祖先。

BMC Bioinformatics. 2015 Jan 16;16:4. doi: 10.1186/s12859-014-0418-7.

Visualizing the geography of genetic variants.可视化基因变异的分布情况。

Bioinformatics. 2017 Feb 15;33(4):594-595. doi: 10.1093/bioinformatics/btw643.

Fast and accurate site frequency spectrum estimation from low coverage sequence data.从低覆盖度序列数据中快速准确地估计位点频率谱

Bioinformatics. 2015 Mar 1;31(5):720-7. doi: 10.1093/bioinformatics/btu725. Epub 2014 Oct 30.

SNP calling by sequencing pooled samples.基于测序的混合样本 SNP 检测。

BMC Bioinformatics. 2012 Sep 20;13:239. doi: 10.1186/1471-2105-13-239.

An empirical Bayes method for genotyping and SNP detection using multi-sample next-generation sequencing data.基于多样本下一代测序数据的基因分型和 SNP 检测的经验贝叶斯方法。

Bioinformatics. 2016 Nov 1;32(21):3240-3245. doi: 10.1093/bioinformatics/btw409. Epub 2016 Jul 4.

CSHAP: efficient haplotype frequency estimation based on sparse representation.CSHAP：基于稀疏表示的高效单倍型频率估计

Bioinformatics. 2019 Aug 15;35(16):2827-2833. doi: 10.1093/bioinformatics/bty1040.

引用本文的文献

Factors underlying migratory timing of a seasonally migrating bird.季节性迁徙鸟类迁徙时间的潜在因素。

Sci Rep. 2025 Mar 12;15(1):8527. doi: 10.1038/s41598-025-93442-2.

molecular surveillance to inform the Mozambican National Malaria Control Programme strategy: protocol.分子监测以告知莫桑比克国家疟疾控制规划策略：方案。

BMJ Open. 2024 Nov 24;14(11):e092590. doi: 10.1136/bmjopen-2024-092590.

Winter connectivity and leapfrog migration in a migratory passerine.一种候鸟的冬季连通性与跨越式迁徙

Ecol Evol. 2023 Feb 1;13(2):e9769. doi: 10.1002/ece3.9769. eCollection 2023 Feb.

Predicting geographic location from genetic variation with deep neural networks.利用深度神经网络从遗传变异中预测地理位置。

Elife. 2020 Jun 8;9:e54507. doi: 10.7554/eLife.54507.

Bioinformatics. 2017 Mar 15;33(6):879-885. doi: 10.1093/bioinformatics/btw720.

本文引用的文献

A model-based approach for analysis of spatial structure in genetic data.基于模型的方法分析遗传数据中的空间结构。

Nat Genet. 2012 May 20;44(6):725-31. doi: 10.1038/ng.2285.

Enhancements to the ADMIXTURE algorithm for individual ancestry estimation.ADMIXTURE 算法在个体血统估计中的改进。

BMC Bioinformatics. 2011 Jun 18;12:246. doi: 10.1186/1471-2105-12-246.

A quasi-Newton acceleration for high-dimensional optimization algorithms.一种用于高维优化算法的拟牛顿加速法。

Stat Comput. 2011 Jan 4;21(2):261-273. doi: 10.1007/s11222-009-9166-3.

Statistical methods in spatial genetics.空间遗传学中的统计方法。

Mol Ecol. 2009 Dec;18(23):4734-56. doi: 10.1111/j.1365-294X.2009.04410.x. Epub 2009 Oct 29.

Fast model-based estimation of ancestry in unrelated individuals.基于模型的无关个体祖先快速估计

Genome Res. 2009 Sep;19(9):1655-64. doi: 10.1101/gr.094052.109. Epub 2009 Jul 31.

The Population Reference Sample, POPRES: a resource for population, disease, and pharmacological genetics research.人口参考样本（POPRES）：用于人口、疾病和药物遗传学研究的资源。

Am J Hum Genet. 2008 Sep;83(3):347-58. doi: 10.1016/j.ajhg.2008.08.005. Epub 2008 Aug 28.

Genes mirror geography within Europe.基因反映了欧洲内部的地理特征。

Nature. 2008 Nov 6;456(7218):98-101. doi: 10.1038/nature07331. Epub 2008 Aug 31.

Correlation between genetic and geographic structure in Europe.欧洲基因结构与地理结构之间的相关性。

Curr Biol. 2008 Aug 26;18(16):1241-8. doi: 10.1016/j.cub.2008.07.049. Epub 2008 Aug 7.

Penalized estimation of haplotype frequencies.单倍型频率的惩罚估计

Bioinformatics. 2008 Jul 15;24(14):1596-602. doi: 10.1093/bioinformatics/btn236. Epub 2008 May 16.

Convergence of EM image reconstruction algorithms with Gibbs smoothing.与 Gibbs 平滑相结合的 EM 图像重建算法的收敛性。

IEEE Trans Med Imaging. 1990;9(4):439-46. doi: 10.1109/42.61759.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验