Department of Mathematics, Simon Fraser University, Burnaby, Canada.
Department of Mathematics, Simon Fraser University, Burnaby, Canada.
Infect Genet Evol. 2023 Sep;113:105484. doi: 10.1016/j.meegid.2023.105484. Epub 2023 Jul 31.
Clustering pathogen sequence data is a common practice in epidemiology to gain insights into the genetic diversity and evolutionary relationships among pathogens. We can find groups of cases with a shared transmission history and common origin, as well as identifying transmission hotspots. Motivated by the experience of clustering SARS-CoV-2 cases using whole genome sequence data during the COVID-19 pandemic to aid with public health investigation, we investigated how differences in epidemiology and sampling can influence the composition of clusters that are identified.
We performed genomic clustering on simulated SARS-CoV-2 outbreaks produced with different transmission rates and levels of genomic diversity, along with varying the proportion of cases sampled.
In single outbreaks with a low transmission rate, decreasing the sampling fraction resulted in multiple, separate clusters being identified where intermediate cases in transmission chains are missed. Outbreaks simulated with a high transmission rate were more robust to changes in the sampling fraction and largely resulted in a single cluster that included all sampled outbreak cases. When considering multiple outbreaks in a sampled jurisdiction seeded by different introductions, low genomic diversity between introduced cases caused outbreaks to be merged into large clusters. If the transmission and sampling fraction, and diversity between introductions was low, a combination of the spurious break-up of outbreaks and the linking of closely related cases in different outbreaks resulted in clusters that may appear informative, but these did not reflect the true underlying population structure. Conversely, genomic clusters matched the true population structure when there was relatively high diversity between introductions and a high transmission rate.
Differences in epidemiology and sampling can impact our ability to identify genomic clusters that describe the underlying population structure. These findings can help to guide recommendations for the use of pathogen clustering in public health investigations.
在流行病学中,对病原体序列数据进行聚类是一种常见的做法,可深入了解病原体的遗传多样性和进化关系。我们可以找到具有共同传播史和共同起源的病例组,并确定传播热点。受 COVID-19 大流行期间使用全基因组序列数据对 SARS-CoV-2 病例进行聚类以辅助公共卫生调查的经验启发,我们研究了流行病学和采样差异如何影响所识别的聚类的组成。
我们对不同传播率和基因组多样性水平以及不同病例采样比例产生的模拟 SARS-CoV-2 暴发进行了基因组聚类。
在低传播率的单一暴发中,减少采样比例会导致识别出多个单独的聚类,而在传播链中的中间病例则被遗漏。高传播率模拟的暴发对采样比例的变化具有更强的鲁棒性,并且主要导致包含所有采样暴发病例的单个聚类。在采样司法管辖区中考虑多个由不同引入引起的暴发时,如果引入病例之间的基因组多样性低,则暴发会合并为大聚类。如果传播率、采样比例和引入之间的多样性低,则暴发的虚假分裂和不同暴发中密切相关病例的链接会导致聚类看起来很有信息量,但这些聚类并不反映真实的潜在人群结构。相反,当引入之间存在相对较高的多样性和较高的传播率时,基因组聚类与真实的人群结构相匹配。
流行病学和采样的差异会影响我们识别描述潜在人群结构的基因组聚类的能力。这些发现有助于指导在公共卫生调查中使用病原体聚类的建议。