Division of EcoScience, Ewha Womans University, Seoul 03760, Korea.
Department of Life Science, Ewha Womans University, Seoul 03760, Korea.
Genetics. 2021 Feb 9;217(2). doi: 10.1093/genetics/iyaa039.
Basic summary statistics that quantify the population genetic structure of influenza virus are important for understanding and inferring the evolutionary and epidemiological processes. However, the sampling dates of global virus sequences in the last several decades are scattered nonuniformly throughout the calendar. Such temporal structure of samples and the small effective size of viral population hampers the use of conventional methods to calculate summary statistics. Here, we define statistics that overcome this problem by correcting for the sampling-time difference in quantifying a pairwise sequence difference. A simple linear regression method jointly estimates the mutation rate and the level of sequence polymorphism, thus providing an estimate of the effective population size. It also leads to the definition of Wright's FST for arbitrary time-series data. Furthermore, as an alternative to Tajima's D statistic or the site-frequency spectrum, a mismatch distribution corrected for sampling-time differences can be obtained and compared between actual and simulated data. Application of these methods to seasonal influenza A/H3N2 viruses sampled between 1980 and 2017 and sequences simulated under the model of recurrent positive selection with metapopulation dynamics allowed us to estimate the synonymous mutation rate and find parameter values for selection and demographic structure that fit the observation. We found that the mutation rates of HA and PB1 segments before 2007 were particularly high and that including recurrent positive selection in our model was essential for the genealogical structure of the HA segment. Methods developed here can be generally applied to population genetic inferences using serially sampled genetic data.
基本的汇总统计数据可以量化流感病毒的种群遗传结构,对于理解和推断病毒的进化和流行病学过程非常重要。然而,过去几十年中全球病毒序列的采样日期在整个日历中分布不均匀。这种样本的时间结构和病毒种群的小有效大小阻碍了使用传统方法来计算汇总统计数据。在这里,我们定义了通过校正定量成对序列差异的采样时间差异来克服这个问题的统计数据。一种简单的线性回归方法联合估计了突变率和序列多态性水平,从而提供了有效种群大小的估计值。它还导致了 Wright 的 FST 定义,适用于任意时间序列数据。此外,作为 Tajima 的 D 统计量或位点频率谱的替代方法,可以获得校正采样时间差异的错配分布,并在实际数据和模拟数据之间进行比较。将这些方法应用于 1980 年至 2017 年间采样的季节性甲型流感 A/H3N2 病毒和在具有元种群动态的反复正选择模型下模拟的序列,我们可以估计同义突变率并找到适合观察的选择和人口结构的参数值。我们发现,2007 年之前 HA 和 PB1 片段的突变率特别高,并且在我们的模型中包括反复正选择对于 HA 片段的系统发育结构是必不可少的。这里开发的方法可以广泛应用于使用连续采样遗传数据进行的种群遗传推断。