Weaver Steven, Dávila-Conn Vanessa, Ji Daniel, Verdonk Hannah, Ávila-Ríos Santiago, Leigh Brown Andrew J, Wertheim Joel O, Kosakovsky Pond Sergei L
Center for Viral Evolution, Temple University, Philadelphia, PA, USA.
Center for Research in Infectious Diseases, National Institute of Respiratory Diseases, Mexico City, Mexico.
bioRxiv. 2024 Mar 14:2024.03.11.584522. doi: 10.1101/2024.03.11.584522.
Molecular surveillance of viral pathogens and inference of transmission networks from genomic data play an increasingly important role in public health efforts, especially for HIV-1. For many methods, the genetic distance threshold used to connect sequences in the transmission network is a key parameter informing the properties of inferred networks. Using a distance threshold that is too high can result in a network with many spurious links, making it difficult to interpret. Conversely, a distance threshold that is too low can result in a network with too few links, which may not capture key insights into clusters of public health concern. Published research using the HIV-TRACE software package frequently uses the default threshold of 0.015 substitutions/site for HIV pol gene sequences, but in many cases, investigators heuristically select other threshold parameters to better capture the underlying dynamics of the epidemic they are studying. Here, we present a general heuristic scoring approach for tuning a distance threshold adaptively, which seeks to prevent the formation of giant clusters. We prioritize the ratio of the sizes of the largest and the second largest cluster, maximizing the number of clusters present in the network. We apply our scoring heuristic to outbreaks with different characteristics, such as regional or temporal variability, and demonstrate the utility of using the scoring mechanism's suggested distance threshold to identify clusters exhibiting risk factors that would have otherwise been more difficult to identify. For example, while we found that a 0.015 substitutions/site distance threshold is typical for US-like epidemics, recent outbreaks like the CRF07_BC subtype among men who have sex with men (MSM) in China have been found to have a lower optimal threshold of 0.005 to better capture the transition from injected drug use (IDU) to MSM as the primary risk factor. Alternatively, in communities surrounding Lake Victoria in Uganda, where there has been sustained hetero-sexual transmission for many years, we found that a larger distance threshold is necessary to capture a more risk factor-diverse population with sparse sampling over a longer period of time. Such identification may allow for more informed intervention action by respective public health officials.
病毒病原体的分子监测以及从基因组数据推断传播网络在公共卫生工作中发挥着越来越重要的作用,尤其是对于HIV-1而言。对于许多方法,用于在传播网络中连接序列的遗传距离阈值是一个关键参数,它决定了推断网络的属性。使用过高的距离阈值会导致网络出现许多虚假链接,难以解释。相反,过低的距离阈值会导致网络链接过少,可能无法捕捉到对公共卫生相关集群的关键见解。使用HIV-TRACE软件包发表的研究经常将HIV pol基因序列的默认阈值设为0.015替换/位点,但在许多情况下,研究人员会凭经验选择其他阈值参数,以更好地捕捉他们所研究疫情的潜在动态。在此,我们提出一种通用的启发式评分方法,用于自适应调整距离阈值,旨在防止形成巨大集群。我们将最大集群与第二大集群的大小之比作为优先考虑因素,使网络中存在的集群数量最大化。我们将评分启发式方法应用于具有不同特征的疫情爆发,如区域或时间变异性,并展示使用评分机制建议的距离阈值来识别具有风险因素的集群的效用,否则这些集群可能更难识别。例如,虽然我们发现0.015替换/位点的距离阈值对于类似美国的疫情是典型的,但在中国男男性行为者(MSM)中出现的CRF07_BC亚型等近期疫情爆发中,发现较低的最佳阈值0.005能更好地捕捉从注射吸毒(IDU)到MSM作为主要风险因素的转变。或者,在乌干达维多利亚湖周边社区,多年来一直存在持续的异性传播,我们发现需要更大的距离阈值来捕捉在更长时间内抽样稀疏且风险因素多样的人群。这样的识别可能使各公共卫生官员能够采取更明智的干预行动。