Hassler Gabriel W, Magee Andrew, Zhang Zhenyu, Baele Guy, Lemey Philippe, Ji Xiang, Fourment Mathieu, Suchard Marc A
Department of Computational Medicine, University of California, Los Angeles, USA, 90095.
Department of Biostatistics, University of California, Los Angeles, USA, 90095.
Annu Rev Stat Appl. 2023;10:353-377. doi: 10.1146/annurev-statistics-033021-112532. Epub 2022 Sep 28.
Researchers studying the evolution of viral pathogens and other organisms increasingly encounter and use large and complex data sets from multiple different sources. Statistical research in Bayesian phylogenetics has risen to this challenge. Researchers use phylogenetics not only to reconstruct the evolutionary history of a group of organisms, but also to understand the processes that guide its evolution and spread through space and time. To this end, it is now the norm to integrate numerous sources of data. For example, epidemiologists studying the spread of a virus through a region incorporate data including genetic sequences (e.g. DNA), time, location (both continuous and discrete) and environmental covariates (e.g. social connectivity between regions) into a coherent statistical model. Evolutionary biologists routinely do the same with genetic sequences, location, time, fossil and modern phenotypes, and ecological covariates. These complex, hierarchical models readily accommodate both discrete and continuous data and have enormous combined discrete/continuous parameter spaces including, at a minimum, phylogenetic tree topologies and branch lengths. The increased size and complexity of these statistical models have spurred advances in computational methods to make them tractable. We discuss both the modeling and computational advances below, as well as unsolved problems and areas of active research.
研究病毒病原体及其他生物体进化的研究人员越来越多地遇到并使用来自多个不同来源的大型复杂数据集。贝叶斯系统发育学的统计研究已应对这一挑战。研究人员不仅使用系统发育学来重建一组生物体的进化历史,还用以理解引导其进化以及在时空上传播的过程。为此,整合众多数据源如今已成为常态。例如,研究病毒在某一地区传播的流行病学家会将包括基因序列(如DNA)、时间、地点(连续和离散的)以及环境协变量(如地区之间的社会联系)等数据纳入一个连贯的统计模型。进化生物学家在处理基因序列、地点、时间、化石和现代表型以及生态协变量时也经常这样做。这些复杂的分层模型能够轻松容纳离散和连续数据,并且具有巨大的组合离散/连续参数空间,至少包括系统发育树拓扑结构和分支长度。这些统计模型规模和复杂性的增加推动了计算方法的进步,以使它们易于处理。我们在下面讨论建模和计算方面的进展,以及未解决的问题和活跃的研究领域。