Department of Mathematics, Michigan State University, MI, 48824, USA.
Department of Mathematics, Statistics, and Computer Science, University of Illinois at Chicago, Chicago, IL, 60607, USA.
Comput Biol Med. 2021 Apr;131:104264. doi: 10.1016/j.compbiomed.2021.104264. Epub 2021 Feb 22.
Coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has a worldwide devastating effect. Understanding the evolution and transmission of SARS-CoV-2 is of paramount importance for controlling, combating and preventing COVID-19. Due to the rapid growth in both the number of SARS-CoV-2 genome sequences and the number of unique mutations, the phylogenetic analysis of SARS-CoV-2 genome isolates faces an emergent large-data challenge. We introduce a dimension-reduced K-means clustering strategy to tackle this challenge. We examine the performance and effectiveness of three dimension-reduction algorithms: principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP). By using four benchmark datasets, we found that UMAP is the best-suited technique due to its stable, reliable, and efficient performance, its ability to improve clustering accuracy, especially for large Jaccard distanced-based datasets, and its superior clustering visualization. The UMAP-assisted K-means clustering enables us to shed light on increasingly large datasets from SARS-CoV-2 genome isolates.
由严重急性呼吸系统综合症冠状病毒 2 型(SARS-CoV-2)引起的 2019 年冠状病毒病(COVID-19)在全球范围内具有破坏性影响。了解 SARS-CoV-2 的进化和传播对于控制、对抗和预防 COVID-19 至关重要。由于 SARS-CoV-2 基因组序列数量和独特突变数量的快速增长,对 SARS-CoV-2 基因组分离物的系统发育分析面临着新兴的大数据挑战。我们引入了一种降维 K-均值聚类策略来应对这一挑战。我们检验了三种降维算法的性能和有效性:主成分分析(PCA)、t 分布随机邻域嵌入(t-SNE)和一致流形逼近与投影(UMAP)。通过使用四个基准数据集,我们发现 UMAP 是最合适的技术,因为它具有稳定、可靠和高效的性能,能够提高聚类准确性,特别是对于大的基于 Jaccard 距离的数据集,并且具有优越的聚类可视化效果。UMAP 辅助的 K-均值聚类使我们能够揭示越来越大的 SARS-CoV-2 基因组分离物数据集。