Layan Maylis, Müller Nicola F, Dellicour Simon, De Maio Nicola, Bourhy Hervé, Cauchemez Simon, Baele Guy
Mathematical Modelling of Infectious Diseases Unit, Institut Pasteur, Université Paris Cité, UMR2000, CNRS, 25-28 rue du Docteur Roux, Paris 75014, France.
Collège Doctoral, Sorbonne Université, 21, rue de l'école de médecine, Paris 75006, France.
Virus Evol. 2023 Feb 6;9(1):vead010. doi: 10.1093/ve/vead010. eCollection 2023.
Bayesian phylogeographic inference is a powerful tool in molecular epidemiological studies, which enables reconstruction of the origin and subsequent geographic spread of pathogens. Such inference is, however, potentially affected by geographic sampling bias. Here, we investigated the impact of sampling bias on the spatiotemporal reconstruction of viral epidemics using Bayesian discrete phylogeographic models and explored different operational strategies to mitigate this impact. We considered the continuous-time Markov chain (CTMC) model and two structured coalescent approximations (Bayesian structured coalescent approximation [BASTA] and marginal approximation of the structured coalescent [MASCOT]). For each approach, we compared the estimated and simulated spatiotemporal histories in biased and unbiased conditions based on the simulated epidemics of rabies virus (RABV) in dogs in Morocco. While the reconstructed spatiotemporal histories were impacted by sampling bias for the three approaches, BASTA and MASCOT reconstructions were also biased when employing unbiased samples. Increasing the number of analyzed genomes led to more robust estimates at low sampling bias for the CTMC model. Alternative sampling strategies that maximize the spatiotemporal coverage greatly improved the inference at intermediate sampling bias for the CTMC model, and to a lesser extent, for BASTA and MASCOT. In contrast, allowing for time-varying population sizes in MASCOT resulted in robust inference. We further applied these approaches to two empirical datasets: a RABV dataset from the Philippines and a SARS-CoV-2 dataset describing its early spread across the world. In conclusion, sampling biases are ubiquitous in phylogeographic analyses but may be accommodated by increasing the sample size, balancing spatial and temporal composition in the samples, and informing structured coalescent models with reliable case count data.
贝叶斯系统发育地理学推断是分子流行病学研究中的一种强大工具,它能够重建病原体的起源及随后的地理传播。然而,这种推断可能会受到地理采样偏差的影响。在此,我们使用贝叶斯离散系统发育地理学模型研究了采样偏差对病毒流行时空重建的影响,并探索了不同的操作策略来减轻这种影响。我们考虑了连续时间马尔可夫链(CTMC)模型以及两种结构化合并近似方法(贝叶斯结构化合并近似 [BASTA] 和结构化合并的边际近似 [MASCOT])。对于每种方法,我们基于摩洛哥犬类狂犬病病毒(RABV)的模拟流行情况,比较了在有偏差和无偏差条件下估计的和模拟的时空历史。虽然这三种方法重建的时空历史都受到采样偏差的影响,但在使用无偏差样本时,BASTA 和 MASCOT 重建也存在偏差。对于 CTMC 模型,增加分析的基因组数量在低采样偏差情况下能得到更稳健的估计。最大化时空覆盖的替代采样策略在中等采样偏差情况下极大地改善了 CTMC 模型的推断,对 BASTA 和 MASCOT 的改善程度较小。相比之下,在 MASCOT 中考虑随时间变化的种群大小能得到稳健的推断。我们进一步将这些方法应用于两个实证数据集:一个来自菲律宾的 RABV 数据集和一个描述严重急性呼吸综合征冠状病毒 2(SARS-CoV-2)早期全球传播的数据集。总之,采样偏差在系统发育地理学分析中普遍存在,但可以通过增加样本量、平衡样本中的空间和时间组成以及用可靠的病例数数据为结构化合并模型提供信息来加以应对。