Nanduri Sravani, Black Allison, Bedford Trevor, Huddleston John
Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, United States.
Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Center, Seattle, WA, United States.
Virus Evol. 2024 Nov 14;10(1):veae087. doi: 10.1093/ve/veae087. eCollection 2024.
Public health researchers and practitioners commonly infer phylogenies from viral genome sequences to understand transmission dynamics and identify clusters of genetically-related samples. However, viruses that reassort or recombine violate phylogenetic assumptions and require more sophisticated methods. Even when phylogenies are appropriate, they can be unnecessary or difficult to interpret without specialty knowledge. For example, pairwise distances between sequences can be enough to identify clusters of related samples or assign new samples to existing phylogenetic clusters. In this work, we tested whether dimensionality reduction methods could capture known genetic groups within two human pathogenic viruses that cause substantial human morbidity and mortality and frequently reassort or recombine, respectively: seasonal influenza A/H3N2 and SARS-CoV-2. We applied principal component analysis, multidimensional scaling (MDS), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection to sequences with well-defined phylogenetic clades and either reassortment (H3N2) or recombination (SARS-CoV-2). For each low-dimensional embedding of sequences, we calculated the correlation between pairwise genetic and Euclidean distances in the embedding and applied a hierarchical clustering method to identify clusters in the embedding. We measured the accuracy of clusters compared to previously defined phylogenetic clades, reassortment clusters, or recombinant lineages. We found that MDS embeddings accurately represented pairwise genetic distances including the intermediate placement of recombinant SARS-CoV-2 lineages between parental lineages. Clusters from t-SNE embeddings accurately recapitulated known phylogenetic clades, H3N2 reassortment groups, and SARS-CoV-2 recombinant lineages. We show that simple statistical methods without a biological model can accurately represent known genetic relationships for relevant human pathogenic viruses. Our open source implementation of these methods for analysis of viral genome sequences can be easily applied when phylogenetic methods are either unnecessary or inappropriate.
公共卫生研究人员和从业者通常从病毒基因组序列推断系统发育关系,以了解传播动态并识别基因相关样本的集群。然而,发生重配或重组的病毒违反了系统发育假设,需要更复杂的方法。即使系统发育关系适用,如果没有专业知识,它们可能也不必要或难以解释。例如,序列之间的成对距离可能足以识别相关样本的集群或将新样本分配到现有的系统发育集群中。在这项工作中,我们测试了降维方法是否能够捕捉两种人类致病病毒中已知的基因群体,这两种病毒分别导致大量人类发病和死亡,并且经常发生重配或重组:季节性甲型H3N2流感病毒和SARS-CoV-2。我们将主成分分析、多维缩放(MDS)、t分布随机邻域嵌入(t-SNE)以及均匀流形近似和投影应用于具有明确系统发育分支以及重配(H3N2)或重组(SARS-CoV-2)的序列。对于序列的每个低维嵌入,我们计算了嵌入中基因距离和欧几里得距离之间的相关性,并应用层次聚类方法来识别嵌入中的集群。我们将集群的准确性与先前定义的系统发育分支、重配集群或重组谱系进行了比较。我们发现,MDS嵌入准确地表示了成对基因距离,包括重组SARS-CoV-2谱系在亲本谱系之间的中间位置。来自t-SNE嵌入的集群准确地概括了已知的系统发育分支、H3N2重配组和SARS-CoV-2重组谱系。我们表明,无需生物学模型的简单统计方法可以准确地表示相关人类致病病毒的已知基因关系。当系统发育方法不必要或不适用时,我们用于分析病毒基因组序列的这些方法的开源实现可以很容易地应用。