Department of Microbiology, Immunology and Transplantation, Rega Institute, KU Leuven, Leuven, Belgium.
Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, CA.
Mol Biol Evol. 2020 Jun 1;37(6):1832-1842. doi: 10.1093/molbev/msaa047.
Reconstructing pathogen dynamics from genetic data as they become available during an outbreak or epidemic represents an important statistical scenario in which observations arrive sequentially in time and one is interested in performing inference in an "online" fashion. Widely used Bayesian phylogenetic inference packages are not set up for this purpose, generally requiring one to recompute trees and evolutionary model parameters de novo when new data arrive. To accommodate increasing data flow in a Bayesian phylogenetic framework, we introduce a methodology to efficiently update the posterior distribution with newly available genetic data. Our procedure is implemented in the BEAST 1.10 software package, and relies on a distance-based measure to insert new taxa into the current estimate of the phylogeny and imputes plausible values for new model parameters to accommodate growing dimensionality. This augmentation creates informed starting values and re-uses optimally tuned transition kernels for posterior exploration of growing data sets, reducing the time necessary to converge to target posterior distributions. We apply our framework to data from the recent West African Ebola virus epidemic and demonstrate a considerable reduction in time required to obtain posterior estimates at different time points of the outbreak. Beyond epidemic monitoring, this framework easily finds other applications within the phylogenetics community, where changes in the data-in terms of alignment changes, sequence addition or removal-present common scenarios that can benefit from online inference.
在疫情爆发或流行期间,随着遗传数据的不断出现,对病原体动态进行重建代表了一个重要的统计场景,在这种场景中,观测结果会随着时间的推移而依次出现,人们有兴趣以“在线”方式进行推断。广泛使用的贝叶斯系统发育推断包并非为此目的而设计,通常需要在新数据到达时重新计算树和进化模型参数。为了在贝叶斯系统发育框架中适应不断增加的数据流量,我们引入了一种方法,以便有效地利用新出现的遗传数据更新后验分布。我们的程序在 BEAST 1.10 软件包中实现,并且依赖于基于距离的度量来将新分类群插入到当前的系统发育估计中,并为新模型参数推断合理的值以适应不断增加的维度。这种扩充创建了有信息的起始值,并为不断增长的数据集的后验探索重新使用了最优调整的转移核,从而减少了收敛到目标后验分布所需的时间。我们将该框架应用于最近西非埃博拉病毒疫情的数据中,并证明了在疫情的不同时间点获得后验估计所需的时间大大减少。除了疫情监测之外,该框架在系统发育学社区中还很容易找到其他应用,在这些应用中,数据的变化(例如对齐变化、序列添加或删除)是常见的情况,可以从在线推断中受益。