Suppr超能文献

将离散时间赖特-费希尔模型扩展到生物样本库规模的数据集。

Scaling the Discrete-time Wright Fisher model to biobank-scale datasets.

作者信息

Spence Jeffrey P, Zeng Tony, Mostafavi Hakhamanesh, Pritchard Jonathan K

机构信息

Department of Genetics, Stanford University.

Department of Biology, Stanford University.

出版信息

bioRxiv. 2023 May 22:2023.05.19.541517. doi: 10.1101/2023.05.19.541517.

Abstract

The Discrete-Time Wright Fisher (DTWF) model and its large population diffusion limit are central to population genetics. These models describe the forward-in-time evolution of the frequency of an allele in a population and can include the fundamental forces of genetic drift, mutation, and selection. Computing like-lihoods under the diffusion process is feasible, but the diffusion approximation breaks down for large sample sizes or in the presence of strong selection. Unfortunately, existing methods for computing likelihoods under the DTWF model do not scale to current exome sequencing sample sizes in the hundreds of thousands. Here we present an algorithm that approximates the DTWF model with provably bounded error and runs in time linear in the size of the population. Our approach relies on two key observations about Binomial distributions. The first is that Binomial distributions are approximately sparse. The second is that Binomial distributions with similar success probabilities are extremely close as distributions, allowing us to approximate the DTWF Markov transition matrix as a very low rank matrix. Together, these observations enable matrix-vector multiplication in linear (as opposed to the usual quadratic) time. We prove similar properties for Hypergeometric distributions, enabling fast computation of likelihoods for subsamples of the population. We show theoretically and in practice that this approximation is highly accurate and can scale to population sizes in the billions, paving the way for rigorous biobank-scale population genetic inference. Finally, we use our results to estimate how increasing sample sizes will improve the estimation of selection coefficients acting on loss-of-function variants. We find that increasing sample sizes beyond existing large exome sequencing cohorts will provide essentially no additional information except for genes with the most extreme fitness effects.

摘要

离散时间赖特-费希尔(DTWF)模型及其大群体扩散极限是群体遗传学的核心。这些模型描述了群体中等位基因频率随时间向前的演变,并且可以纳入遗传漂变、突变和选择等基本力量。在扩散过程下计算似然性是可行的,但对于大样本量或存在强选择的情况,扩散近似会失效。不幸的是,现有的在DTWF模型下计算似然性的方法无法扩展到目前数十万规模的外显子组测序样本量。在此,我们提出一种算法,该算法以可证明的有界误差近似DTWF模型,并且运行时间与群体大小成线性关系。我们的方法依赖于关于二项分布的两个关键观察结果。第一个是二项分布近似稀疏。第二个是具有相似成功概率的二项分布作为分布极其接近,这使我们能够将DTWF马尔可夫转移矩阵近似为一个秩非常低的矩阵。综合起来,这些观察结果使得矩阵-向量乘法能够在线性时间(与通常的二次时间相对)内完成。我们证明了超几何分布也有类似性质,从而能够快速计算群体子样本的似然性。我们在理论和实践上都表明,这种近似非常准确,并且可以扩展到数十亿的群体规模,为严格的生物样本库规模的群体遗传推断铺平了道路。最后,我们利用我们的结果来估计增加样本量将如何改善对作用于功能丧失变异的选择系数的估计。我们发现,除了具有最极端适合度效应的基因外,将样本量增加到超过现有的大型外显子组测序队列基本上不会提供额外信息。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4d4b/10245735/28636304f374/nihpp-2023.05.19.541517v1-f0008.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验