同源片段的快速模拟。

Fast simulation of identity-by-descent segments.

作者信息

Temple Seth D, Browning Sharon R, Thompson Elizabeth A

机构信息

Department of Statistics, University of Washington, Seattle, WA, USA.

Department of Statistics, University of Michigan, Ann Arbor, MI, USA.

出版信息

Bull Math Biol. 2025 May 23;87(7):84. doi: 10.1007/s11538-025-01464-8.

DOI:10.1007/s11538-025-01464-8

PMID:40410602

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12102126/

Abstract

The worst-case runtime complexity to simulate haplotype segments identical by descent (IBD) is quadratic in sample size. We propose two main techniques to reduce the compute time, both of which are motivated by coalescent and recombination processes. We provide mathematical results that explain why our algorithm should outperform a naive implementation with high probability. In our experiments, we observe average compute times to simulate detectable IBD segments around a locus that scale approximately linearly in sample size and take a couple of seconds for sample sizes that are less than 10,000 diploid individuals. In contrast, we find that existing methods to simulate IBD segments take minutes to hours for sample sizes exceeding a few thousand diploid individuals. When using IBD segments to study recent positive selection around a locus, our efficient simulation algorithm makes feasible statistical inferences, e.g., parametric bootstrapping in analyses of large biobanks, that would be otherwise intractable.

摘要

模拟同源相同的单倍型片段（IBD）的最坏情况运行时复杂度与样本量呈二次方关系。我们提出了两种主要技术来减少计算时间，这两种技术均受溯祖和重组过程的启发。我们提供了数学结果，解释了为什么我们的算法很可能优于简单的实现方式。在我们的实验中，我们观察到在一个位点周围模拟可检测的IBD片段的平均计算时间与样本量大致呈线性关系，对于样本量小于10000个二倍体个体的情况，计算时间只需几秒钟。相比之下，我们发现现有模拟IBD片段的方法对于超过几千个二倍体个体的样本量需要数分钟到数小时。当使用IBD片段来研究一个位点周围的近期正选择时，我们高效的模拟算法使得可行的统计推断成为可能，例如在大型生物样本库分析中的参数自举法，否则这将是难以处理的。