Saulnier Emma, Gascuel Olivier, Alizon Samuel
Laboratoire Maladies Infectieuses et Vecteurs: Ecologie, Génétique, Evolution et Contrôle - UMR CNRS 5290, IRD 224 et UM, Montpellier, France.
Institut de Biologie Computationnelle (IBC) and Laboratoire d'Informatique, de Robotique et de Microélectronique de Montpellier (LIRMM) - UMR 5506, CNRS et UM, Montpellier, France.
PLoS Comput Biol. 2017 Mar 6;13(3):e1005416. doi: 10.1371/journal.pcbi.1005416. eCollection 2017 Mar.
Inferring epidemiological parameters such as the R0 from time-scaled phylogenies is a timely challenge. Most current approaches rely on likelihood functions, which raise specific issues that range from computing these functions to finding their maxima numerically. Here, we present a new regression-based Approximate Bayesian Computation (ABC) approach, which we base on a large variety of summary statistics intended to capture the information contained in the phylogeny and its corresponding lineage-through-time plot. The regression step involves the Least Absolute Shrinkage and Selection Operator (LASSO) method, which is a robust machine learning technique. It allows us to readily deal with the large number of summary statistics, while avoiding resorting to Markov Chain Monte Carlo (MCMC) techniques. To compare our approach to existing ones, we simulated target trees under a variety of epidemiological models and settings, and inferred parameters of interest using the same priors. We found that, for large phylogenies, the accuracy of our regression-ABC is comparable to that of likelihood-based approaches involving birth-death processes implemented in BEAST2. Our approach even outperformed these when inferring the host population size with a Susceptible-Infected-Removed epidemiological model. It also clearly outperformed a recent kernel-ABC approach when assuming a Susceptible-Infected epidemiological model with two host types. Lastly, by re-analyzing data from the early stages of the recent Ebola epidemic in Sierra Leone, we showed that regression-ABC provides more realistic estimates for the duration parameters (latency and infectiousness) than the likelihood-based method. Overall, ABC based on a large variety of summary statistics and a regression method able to perform variable selection and avoid overfitting is a promising approach to analyze large phylogenies.
从时间尺度系统发育树推断诸如R0等流行病学参数是一项紧迫的挑战。当前大多数方法依赖于似然函数,这引发了一系列特定问题,从计算这些函数到数值上找到它们的最大值。在这里,我们提出了一种基于回归的新近似贝叶斯计算(ABC)方法,该方法基于大量汇总统计量,旨在捕捉系统发育树及其相应的随时间变化的谱系图中包含的信息。回归步骤涉及最小绝对收缩和选择算子(LASSO)方法,这是一种强大的机器学习技术。它使我们能够轻松处理大量汇总统计量,同时避免采用马尔可夫链蒙特卡罗(MCMC)技术。为了将我们的方法与现有方法进行比较,我们在各种流行病学模型和设置下模拟了目标树,并使用相同的先验推断感兴趣的参数。我们发现,对于大型系统发育树,我们的回归ABC的准确性与BEAST2中实现的基于出生-死亡过程的基于似然的方法相当。当使用易感-感染-移除流行病学模型推断宿主种群大小时,我们的方法甚至优于这些方法。在假设具有两种宿主类型的易感-感染流行病学模型时,它也明显优于最近的核ABC方法。最后,通过重新分析塞拉利昂近期埃博拉疫情早期阶段的数据,我们表明回归ABC比基于似然的方法为持续时间参数(潜伏期和传染性)提供了更现实的估计。总体而言,基于大量汇总统计量和能够进行变量选择并避免过拟合的回归方法的ABC是分析大型系统发育树的一种有前途的方法。