Department of Biostatistics, School of Public Health, University of California, Los Angeles, CA, USA.
Mol Biol Evol. 2012 May;29(5):1367-77. doi: 10.1093/molbev/msr305. Epub 2011 Dec 8.
Unprecedented global surveillance of viruses will result in massive sequence data sets that require new statistical methods. These data sets press the limits of Bayesian phylogenetics as the high-dimensional parameters that comprise a phylogenetic tree increase the already sizable computational burden of these techniques. This burden often results in partitioning the data set, for example, by gene, and inferring the evolutionary dynamics of each partition independently, a compromise that results in stratified analyses that depend only on data within a given partition. However, parameter estimates inferred from these stratified models are likely strongly correlated, considering they rely on data from a single data set. To overcome this shortfall, we exploit the existing Monte Carlo realizations from stratified Bayesian analyses to efficiently estimate a nonparametric hierarchical wavelet-based model and learn about the time-varying parameters of effective population size that reflect levels of genetic diversity across all partitions simultaneously. Our methods are applied to complete genome influenza A sequences that span 13 years. We find that broad peaks and trends, as opposed to seasonal spikes, in the effective population size history distinguish individual segments from the complete genome. We also address hypotheses regarding intersegment dynamics within a formal statistical framework that accounts for correlation between segment-specific parameters.
前所未有的全球病毒监测将产生大量的序列数据集,这些数据集需要新的统计方法。这些数据集对贝叶斯系统发育学提出了挑战,因为构成系统发育树的高维参数增加了这些技术已经相当大的计算负担。这种负担通常导致数据集的划分,例如按基因进行划分,并独立推断每个分区的进化动态,这种折衷方案导致了仅依赖于给定分区内数据的分层分析。然而,从这些分层模型推断出的参数估计值很可能是高度相关的,因为它们依赖于来自单个数据集的数据。为了克服这一不足,我们利用分层贝叶斯分析中的现有蒙特卡罗实现,有效地估计一个非参数分层基于小波的模型,并了解有效种群大小的时变参数,这些参数反映了所有分区的遗传多样性水平。我们的方法应用于跨越 13 年的完整基因组甲型流感序列。我们发现,有效种群大小历史中的广泛峰值和趋势(与季节性峰值相反)将单个片段与完整基因组区分开来。我们还在一个正式的统计框架中解决了关于片段间动态的假设,该框架考虑了特定于片段的参数之间的相关性。