Liu Molly, Chato Connor, Poon Art F Y
Department of Pathology and Laboratory Medicine, Western University, Dental Sciences Building, Rm. 4044, London, ON N6A 5C1, Canada.
Department of Microbiology and Immunology, Western University, 1151 Richmond Street, London, ON N6A 3K7, Canada.
Virus Evol. 2023 Apr 25;9(1):vead026. doi: 10.1093/ve/vead026. eCollection 2023.
Defining clusters of epidemiologically related infections is a common problem in the surveillance of infectious disease. A popular method for generating clusters is pairwise distance clustering, which assigns pairs of sequences to the same cluster if their genetic distance falls below some threshold. The result is often represented as a network or graph of nodes. A connected component is a set of interconnected nodes in a graph that are not connected to any other node. The prevailing approach to pairwise clustering is to map clusters to the connected components of the graph on a one-to-one basis. We propose that this definition of clusters is unnecessarily rigid. For instance, the connected components can collapse into one cluster by the addition of a single sequence that bridges nodes in the respective components. Moreover, the distance thresholds typically used for viruses like HIV-1 tend to exclude a large proportion of new sequences, making it difficult to train models for predicting cluster growth. These issues may be resolved by revisiting how we define clusters from genetic distances. Community detection is a promising class of clustering methods from the field of network science. A community is a set of nodes that are more densely inter-connected relative to the number of their connections to external nodes. Thus, a connected component may be partitioned into two or more communities. Here we describe community detection methods in the context of genetic clustering for epidemiology, demonstrate how a popular method (Markov clustering) enables us to resolve variation in transmission rates within a giant connected component of HIV-1 sequences, and identify current challenges and directions for further work.
在传染病监测中,定义与流行病学相关的感染集群是一个常见问题。一种常用的生成集群的方法是成对距离聚类,即如果两个序列的遗传距离低于某个阈值,就将它们分配到同一个集群中。结果通常表示为节点的网络或图。连通分量是图中一组相互连接的节点,它们不与任何其他节点相连。成对聚类的主流方法是将集群一对一地映射到图的连通分量上。我们认为这种集群定义过于严格。例如,通过添加一个连接各个分量中节点的单个序列,连通分量可以合并为一个集群。此外,像HIV-1这样的病毒通常使用的距离阈值往往会排除很大一部分新序列,从而难以训练预测集群增长的模型。通过重新审视我们如何从遗传距离定义集群,这些问题可能会得到解决。社区检测是网络科学领域中一类很有前景的聚类方法。社区是一组节点,相对于它们与外部节点的连接数量,它们之间的连接更为密集。因此,一个连通分量可能会被划分为两个或更多个社区。在这里,我们在流行病学遗传聚类的背景下描述社区检测方法,展示一种流行方法(马尔可夫聚类)如何使我们能够解决HIV-1序列巨大连通分量内传播率的变化问题,并确定当前的挑战和进一步工作的方向。