Suppr超能文献

基于 UMAP 的 SARS-CoV-2 大规模突变数据集的 K-means 聚类分析。

UMAP-assisted K-means clustering of large-scale SARS-CoV-2 mutation datasets.

机构信息

Department of Mathematics, Michigan State University, MI, 48824, USA.

Department of Mathematics, Statistics, and Computer Science, University of Illinois at Chicago, Chicago, IL, 60607, USA.

出版信息

Comput Biol Med. 2021 Apr;131:104264. doi: 10.1016/j.compbiomed.2021.104264. Epub 2021 Feb 22.

Abstract

Coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has a worldwide devastating effect. Understanding the evolution and transmission of SARS-CoV-2 is of paramount importance for controlling, combating and preventing COVID-19. Due to the rapid growth in both the number of SARS-CoV-2 genome sequences and the number of unique mutations, the phylogenetic analysis of SARS-CoV-2 genome isolates faces an emergent large-data challenge. We introduce a dimension-reduced K-means clustering strategy to tackle this challenge. We examine the performance and effectiveness of three dimension-reduction algorithms: principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP). By using four benchmark datasets, we found that UMAP is the best-suited technique due to its stable, reliable, and efficient performance, its ability to improve clustering accuracy, especially for large Jaccard distanced-based datasets, and its superior clustering visualization. The UMAP-assisted K-means clustering enables us to shed light on increasingly large datasets from SARS-CoV-2 genome isolates.

摘要

由严重急性呼吸系统综合症冠状病毒 2 型(SARS-CoV-2)引起的 2019 年冠状病毒病(COVID-19)在全球范围内具有破坏性影响。了解 SARS-CoV-2 的进化和传播对于控制、对抗和预防 COVID-19 至关重要。由于 SARS-CoV-2 基因组序列数量和独特突变数量的快速增长,对 SARS-CoV-2 基因组分离物的系统发育分析面临着新兴的大数据挑战。我们引入了一种降维 K-均值聚类策略来应对这一挑战。我们检验了三种降维算法的性能和有效性:主成分分析(PCA)、t 分布随机邻域嵌入(t-SNE)和一致流形逼近与投影(UMAP)。通过使用四个基准数据集,我们发现 UMAP 是最合适的技术,因为它具有稳定、可靠和高效的性能,能够提高聚类准确性,特别是对于大的基于 Jaccard 距离的数据集,并且具有优越的聚类可视化效果。UMAP 辅助的 K-均值聚类使我们能够揭示越来越大的 SARS-CoV-2 基因组分离物数据集。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4576/7897976/07bd8a44f992/gr1_lrg.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验