• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于 UMAP 的 SARS-CoV-2 大规模突变数据集的 K-means 聚类分析。

UMAP-assisted K-means clustering of large-scale SARS-CoV-2 mutation datasets.

机构信息

Department of Mathematics, Michigan State University, MI, 48824, USA.

Department of Mathematics, Statistics, and Computer Science, University of Illinois at Chicago, Chicago, IL, 60607, USA.

出版信息

Comput Biol Med. 2021 Apr;131:104264. doi: 10.1016/j.compbiomed.2021.104264. Epub 2021 Feb 22.

DOI:10.1016/j.compbiomed.2021.104264
PMID:33647832
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7897976/
Abstract

Coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has a worldwide devastating effect. Understanding the evolution and transmission of SARS-CoV-2 is of paramount importance for controlling, combating and preventing COVID-19. Due to the rapid growth in both the number of SARS-CoV-2 genome sequences and the number of unique mutations, the phylogenetic analysis of SARS-CoV-2 genome isolates faces an emergent large-data challenge. We introduce a dimension-reduced K-means clustering strategy to tackle this challenge. We examine the performance and effectiveness of three dimension-reduction algorithms: principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP). By using four benchmark datasets, we found that UMAP is the best-suited technique due to its stable, reliable, and efficient performance, its ability to improve clustering accuracy, especially for large Jaccard distanced-based datasets, and its superior clustering visualization. The UMAP-assisted K-means clustering enables us to shed light on increasingly large datasets from SARS-CoV-2 genome isolates.

摘要

由严重急性呼吸系统综合症冠状病毒 2 型(SARS-CoV-2)引起的 2019 年冠状病毒病(COVID-19)在全球范围内具有破坏性影响。了解 SARS-CoV-2 的进化和传播对于控制、对抗和预防 COVID-19 至关重要。由于 SARS-CoV-2 基因组序列数量和独特突变数量的快速增长,对 SARS-CoV-2 基因组分离物的系统发育分析面临着新兴的大数据挑战。我们引入了一种降维 K-均值聚类策略来应对这一挑战。我们检验了三种降维算法的性能和有效性:主成分分析(PCA)、t 分布随机邻域嵌入(t-SNE)和一致流形逼近与投影(UMAP)。通过使用四个基准数据集,我们发现 UMAP 是最合适的技术,因为它具有稳定、可靠和高效的性能,能够提高聚类准确性,特别是对于大的基于 Jaccard 距离的数据集,并且具有优越的聚类可视化效果。UMAP 辅助的 K-均值聚类使我们能够揭示越来越大的 SARS-CoV-2 基因组分离物数据集。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4576/7897976/2c343b34134c/gr10_lrg.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4576/7897976/07bd8a44f992/gr1_lrg.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4576/7897976/37e3c222595a/gr2_lrg.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4576/7897976/28a7e5d835f5/gr3_lrg.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4576/7897976/aa852af13542/gr4_lrg.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4576/7897976/4924c8ca53bb/gr5_lrg.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4576/7897976/852f7953032b/gr6_lrg.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4576/7897976/4aa7b3a68887/gr7_lrg.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4576/7897976/265897edbc03/gr8_lrg.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4576/7897976/0e22146ded19/gr9_lrg.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4576/7897976/2c343b34134c/gr10_lrg.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4576/7897976/07bd8a44f992/gr1_lrg.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4576/7897976/37e3c222595a/gr2_lrg.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4576/7897976/28a7e5d835f5/gr3_lrg.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4576/7897976/aa852af13542/gr4_lrg.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4576/7897976/4924c8ca53bb/gr5_lrg.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4576/7897976/852f7953032b/gr6_lrg.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4576/7897976/4aa7b3a68887/gr7_lrg.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4576/7897976/265897edbc03/gr8_lrg.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4576/7897976/0e22146ded19/gr9_lrg.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4576/7897976/2c343b34134c/gr10_lrg.jpg

相似文献

1
UMAP-assisted K-means clustering of large-scale SARS-CoV-2 mutation datasets.基于 UMAP 的 SARS-CoV-2 大规模突变数据集的 K-means 聚类分析。
Comput Biol Med. 2021 Apr;131:104264. doi: 10.1016/j.compbiomed.2021.104264. Epub 2021 Feb 22.
2
UMAP-assisted $K$-means clustering of large-scale SARS-CoV-2 mutation datasets.大规模SARS-CoV-2突变数据集的UMAP辅助K均值聚类
ArXiv. 2020 Dec 30:arXiv:2012.15268v1.
3
DGCyTOF: Deep learning with graphic cluster visualization to predict cell types of single cell mass cytometry data.DGCyTOF:基于图形聚类可视化的深度学习,用于预测单细胞质谱流式细胞术数据的细胞类型。
PLoS Comput Biol. 2022 Apr 11;18(4):e1008885. doi: 10.1371/journal.pcbi.1008885. eCollection 2022 Apr.
4
Dimensionality reduction by UMAP reinforces sample heterogeneity analysis in bulk transcriptomic data.UMAP 通过降维增强了批量转录组数据中样本异质性分析。
Cell Rep. 2021 Jul 27;36(4):109442. doi: 10.1016/j.celrep.2021.109442.
5
Unsupervised machine learning framework for discriminating major variants of concern during COVID-19.用于鉴别 COVID-19 期间主要关注变体的无监督机器学习框架。
PLoS One. 2023 May 18;18(5):e0285719. doi: 10.1371/journal.pone.0285719. eCollection 2023.
6
Dimensionality reduction distills complex evolutionary relationships in seasonal influenza and SARS-CoV-2.降维提炼了季节性流感和新冠病毒复杂的进化关系。
bioRxiv. 2024 Aug 29:2024.02.07.579374. doi: 10.1101/2024.02.07.579374.
7
In Silico Study of Mutational Stability of SARS-CoV-2 Proteins.计算机模拟研究 SARS-CoV-2 蛋白的突变稳定性。
Protein J. 2021 Jun;40(3):328-340. doi: 10.1007/s10930-021-09988-3. Epub 2021 Apr 22.
8
Identification of Epidemiological Traits by Analysis of SARS-CoV-2 Sequences.通过分析 SARS-CoV-2 序列鉴定流行病学特征。
Viruses. 2021 Apr 27;13(5):764. doi: 10.3390/v13050764.
9
Relevant SARS-CoV-2 Genome Variation through Six Months of Worldwide Monitoring.六个月来全球监测到的相关 SARS-CoV-2 基因组变异。
Biomed Res Int. 2021 Jun 29;2021:5553173. doi: 10.1155/2021/5553173. eCollection 2021.
10
Evaluation of Distance Metrics and Spatial Autocorrelation in Uniform Manifold Approximation and Projection Applied to Mass Spectrometry Imaging Data.基于均摊近似和投影的距离度量和空间自相关评估及其在质谱成像数据中的应用。
Anal Chem. 2019 May 7;91(9):5706-5714. doi: 10.1021/acs.analchem.8b05827. Epub 2019 Apr 25.

引用本文的文献

1
PREDAC-FluB: predicting antigenic clusters of seasonal influenza B viruses with protein language model embedding based convolutional neural network.PREDAC-FluB:基于蛋白质语言模型嵌入的卷积神经网络预测季节性乙型流感病毒的抗原簇
Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf308.
2
Dimensionality reduction for k-means clustering of large-scale influenza mutation datasets.用于大规模流感突变数据集k均值聚类的降维方法
ArXiv. 2025 Apr 4:arXiv:2504.03550v1.
3
Evolution of AI enabled healthcare systems using textual data with a pretrained BERT deep learning model.

本文引用的文献

1
Analysis of SARS-CoV-2 mutations in the United States suggests presence of four substrains and novel variants.分析美国的 SARS-CoV-2 突变情况表明存在四个亚系和新型变体。
Commun Biol. 2021 Feb 15;4(1):228. doi: 10.1038/s42003-021-01754-6.
2
Decoding Asymptomatic COVID-19 Infection and Transmission.解读无症状新冠病毒感染与传播
J Phys Chem Lett. 2020 Dec 3;11(23):10007-10015. doi: 10.1021/acs.jpclett.0c02765. Epub 2020 Nov 12.
3
The emergence of SARS-CoV-2 in Europe and North America.SARS-CoV-2 在欧洲和北美的出现。
使用预训练的BERT深度学习模型的文本数据实现人工智能驱动的医疗保健系统的演进。
Sci Rep. 2025 Mar 4;15(1):7540. doi: 10.1038/s41598-025-91622-8.
4
Immune dysregulation in COVID-19 induced ARDS in kidney transplant recipients revealed by single-cell RNA sequencing.单细胞RNA测序揭示肾移植受者中COVID-19诱发的急性呼吸窘迫综合征的免疫失调
Sci Rep. 2025 Feb 26;15(1):6895. doi: 10.1038/s41598-025-91439-5.
5
Personal identification using a cross-sectional hyperspectral image of a hand.使用手部横截面高光谱图像进行个人身份识别。
J Biomed Opt. 2025 Feb;30(2):023514. doi: 10.1117/1.JBO.30.2.023514. Epub 2024 Dec 16.
6
Refining SARS-CoV-2 intra-host variation by leveraging large-scale sequencing data.利用大规模测序数据优化严重急性呼吸综合征冠状病毒2(SARS-CoV-2)的宿主内变异
NAR Genom Bioinform. 2024 Nov 12;6(4):lqae145. doi: 10.1093/nargab/lqae145. eCollection 2024 Sep.
7
Dietary patterns associated with the incidence of hypertension among adult Japanese males: application of machine learning to a cohort study.饮食习惯与成年日本男性高血压发病率的关系:基于队列研究的机器学习应用。
Eur J Nutr. 2024 Jun;63(4):1293-1314. doi: 10.1007/s00394-024-03342-w. Epub 2024 Feb 25.
8
STW-MD: a novel spatio-temporal weighting and multi-step decision tree method for considering spatial heterogeneity in brain gene expression data.STW-MD:一种新的时空加权和多步决策树方法,用于考虑脑基因表达数据中的空间异质性。
Brief Bioinform. 2024 Jan 22;25(2). doi: 10.1093/bib/bbae051.
9
The role of strategic visibility in shaping wayfinding behavior in multilevel buildings.战略可见性在塑造多层建筑中寻路行为方面的作用。
Sci Rep. 2024 Feb 14;14(1):3735. doi: 10.1038/s41598-024-53420-6.
10
Algorithm-Based Risk Identification in Patients with Breast Cancer-Related Lymphedema: A Cross-Sectional Study.基于算法的乳腺癌相关淋巴水肿患者风险识别:一项横断面研究。
Cancers (Basel). 2023 Jan 4;15(2):336. doi: 10.3390/cancers15020336.
Science. 2020 Oct 30;370(6516):564-570. doi: 10.1126/science.abc8169. Epub 2020 Sep 10.
4
Comprehensive evolution and molecular characteristics of a large number of SARS-CoV-2 genomes reveal its epidemic trends.大量 SARS-CoV-2 基因组的综合进化和分子特征揭示了其流行趋势。
Int J Infect Dis. 2020 Nov;100:164-173. doi: 10.1016/j.ijid.2020.08.066. Epub 2020 Aug 28.
5
Functional Pangenome Analysis Shows Key Features of E Protein Are Preserved in SARS and SARS-CoV-2.功能泛基因组分析显示,E 蛋白的关键特征在 SARS 和 SARS-CoV-2 中得以保留。
Front Cell Infect Microbiol. 2020 Jul 27;10:405. doi: 10.3389/fcimb.2020.00405. eCollection 2020.
6
Mutations Strengthened SARS-CoV-2 Infectivity.突变增强了 SARS-CoV-2 的感染性。
J Mol Biol. 2020 Sep 4;432(19):5212-5226. doi: 10.1016/j.jmb.2020.07.009. Epub 2020 Jul 23.
7
SARS-CoV-2 genomic variations associated with mortality rate of COVID-19.SARS-CoV-2 基因组变异与 COVID-19 死亡率的关系。
J Hum Genet. 2020 Dec;65(12):1075-1082. doi: 10.1038/s10038-020-0808-9. Epub 2020 Jul 22.
8
Tracking Changes in SARS-CoV-2 Spike: Evidence that D614G Increases Infectivity of the COVID-19 Virus.追踪 SARS-CoV-2 刺突蛋白的变化:D614G 增加 COVID-19 病毒感染力的证据。
Cell. 2020 Aug 20;182(4):812-827.e19. doi: 10.1016/j.cell.2020.06.043. Epub 2020 Jul 3.
9
SARS-CoV-2 genomic surveillance in Taiwan revealed novel ORF8-deletion mutant and clade possibly associated with infections in Middle East.台湾地区的 SARS-CoV-2 基因组监测揭示了新型 ORF8 缺失突变株和可能与中东感染相关的分支。
Emerg Microbes Infect. 2020 Dec;9(1):1457-1466. doi: 10.1080/22221751.2020.1782271.
10
Decoding SARS-CoV-2 Transmission and Evolution and Ramifications for COVID-19 Diagnosis, Vaccine, and Medicine.解码 SARS-CoV-2 的传播和进化及其对 COVID-19 诊断、疫苗和药物的影响。
J Chem Inf Model. 2020 Dec 28;60(12):5853-5865. doi: 10.1021/acs.jcim.0c00501. Epub 2020 Jun 25.