Celentano Michael, DeWitt William S, Prillo Sebastian, Song Yun S
Department of Statistics, University of California, Berkeley, CA 94720.
Computer Science Division, University of California, Berkeley, CA 94720.
Proc Natl Acad Sci U S A. 2025 May 20;122(20):e2412978122. doi: 10.1073/pnas.2412978122. Epub 2025 May 14.
Many biological studies involve inferring the evolutionary history of a sample of individuals from a large population and interpreting the reconstructed tree. Such an ascertained tree typically represents only a small part of a comprehensive population tree and is distorted by survivorship and sampling biases. Inferring evolutionary parameters from ascertained trees requires modeling both the underlying population dynamics and the ascertainment process. A crucial component of this phylodynamic modeling involves tree simulation, which is used to benchmark probabilistic inference methods. To simulate an ascertained tree, one must first simulate the full population tree and then prune unobserved lineages. Consequently, the computational cost is determined not by the size of the final simulated tree, but by the size of the population tree in which it is embedded. In most biological scenarios, simulations of the entire population are prohibitively expensive due to computational demands placed on lineages without sampled descendants. Here, we address this challenge by proving that, for any partially ascertained process from a general multitype birth-death-mutation-sampling model, there exists an equivalent process with complete sampling and no death, a property which we leverage to develop a highly efficient algorithm for simulating trees. Our algorithm scales linearly with the size of the final simulated tree and is independent of the population size, enabling simulations from extremely large populations beyond the reach of current methods but essential for various biological applications. We anticipate that this massive speedup will significantly advance the development of novel inference methods that require extensive training data.
许多生物学研究涉及从大量种群中推断个体样本的进化历史,并解读重建的树状图。这样一个确定的树状图通常仅代表综合种群树的一小部分,并且会因生存偏差和抽样偏差而扭曲。从确定的树状图推断进化参数需要对潜在的种群动态和确定过程进行建模。这种系统发育动力学建模的一个关键组成部分涉及树状图模拟,它用于对概率推断方法进行基准测试。为了模拟一个确定的树状图,必须首先模拟完整的种群树,然后修剪未观察到的谱系。因此,计算成本不是由最终模拟树的大小决定的,而是由其嵌入的种群树的大小决定的。在大多数生物学场景中,由于对没有抽样后代的谱系的计算需求,对整个种群进行模拟的成本高得令人望而却步。在这里,我们通过证明对于一般多类型出生-死亡-突变-抽样模型的任何部分确定过程,都存在一个具有完全抽样且无死亡的等效过程来应对这一挑战,我们利用这一特性开发了一种高效的树状图模拟算法。我们的算法与最终模拟树的大小呈线性比例,并且与种群大小无关,能够对当前方法无法处理的极大型种群进行模拟,但这对于各种生物学应用至关重要。我们预计这种大幅加速将显著推进需要大量训练数据的新型推断方法的发展。