Hozumi Yuta, Wang Rui, Yin Changchuan, Wei Guo-Wei
ArXiv. 2020 Dec 30:arXiv:2012.15268v1.
Coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has a worldwide devastating effect. The understanding of evolution and transmission of SARS-CoV-2 is of paramount importance for the COVID-19 control, combating, and prevention. Due to the rapid growth of both the number of SARS-CoV-2 genome sequences and the number of unique mutations, the phylogenetic analysis of SARS-CoV-2 genome isolates faces an emergent large-data challenge. We introduce a dimension-reduced $k$-means clustering strategy to tackle this challenge. We examine the performance and effectiveness of three dimension-reduction algorithms: principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP). By using four benchmark datasets, we found that UMAP is the best-suited technique due to its stable, reliable, and efficient performance, its ability to improve clustering accuracy, especially for large Jaccard distanced-based datasets, and its superior clustering visualization. The UMAP-assisted $k$-means clustering enables us to shed light on increasingly large datasets from SARS-CoV-2 genome isolates.
由严重急性呼吸综合征冠状病毒2(SARS-CoV-2)引起的2019冠状病毒病(COVID-19)在全球范围内造成了毁灭性影响。了解SARS-CoV-2的进化和传播对于控制、抗击和预防COVID-19至关重要。由于SARS-CoV-2基因组序列数量和独特突变数量的迅速增长,对SARS-CoV-2基因组分离株进行系统发育分析面临着新出现的大数据挑战。我们引入一种降维的k均值聚类策略来应对这一挑战。我们考察了三种降维算法的性能和有效性:主成分分析(PCA)、t分布随机邻域嵌入(t-SNE)和均匀流形近似与投影(UMAP)。通过使用四个基准数据集,我们发现UMAP是最适合的技术,因为它性能稳定、可靠且高效,能够提高聚类准确性,特别是对于基于杰卡德距离的大型数据集,并且具有出色的聚类可视化效果。UMAP辅助的k均值聚类使我们能够深入了解来自SARS-CoV-2基因组分离株的日益庞大的数据集。