对 SARS-CoV-2 基因组进行无监督聚类分析反映了其地理进展，并确定了 SARS-CoV-2 病毒的不同遗传亚群。

Unsupervised cluster analysis of SARS-CoV-2 genomes reflects its geographic progression and identifies distinct genetic subgroups of SARS-CoV-2 virus.

机构信息

Department of Biostatistics, T.H. Chan School of Public Health, Harvard University, Boston, Massachusetts, USA.

Department of Medical Consilience, Graduate School, Dankook University, Yongin-si, South Korea.

出版信息

Genet Epidemiol. 2021 Apr;45(3):316-323. doi: 10.1002/gepi.22373. Epub 2021 Jan 8.

DOI:10.1002/gepi.22373

PMID:33415739

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8005425/

Abstract

Over 10,000 viral genome sequences of the SARS-CoV-2virus have been made readily available during the ongoing coronavirus pandemic since the initial genome sequence of the virus was released on the open access Virological website (http://virological.org/) early on January 11. We utilize the published data on the single stranded RNAs of 11,132 SARS-CoV-2 patients in the GISAID database, which contains fully or partially sequenced SARS-CoV-2 samples from laboratories around the world. Among many important research questions which are currently being investigated, one aspect pertains to the genetic characterization/classification of the virus. We analyze data on the nucleotide sequencing of the virus and geographic information of a subset of 7640 SARS-CoV-2 patients without missing entries that are available in the GISAID database. Instead of modeling the mutation rate, applying phylogenetic tree approaches, and so forth, we here utilize a model-free clustering approach that compares the viruses at a genome-wide level. We apply principal component analysis to a similarity matrix that compares all pairs of these SARS-CoV-2 nucleotide sequences at all loci simultaneously, using the Jaccard index. Our analysis results of the SARS-CoV-2 genome data illustrates the geographic and chronological progression of the virus, starting from the first cases that were observed in China to the current wave of cases in Europe and North America. This is in line with a phylogenetic analysis which we use to contrast our results. We also observe that, based on their sequence data, the SARS-CoV-2 viruses cluster in distinct genetic subgroups. It is the subject of ongoing research to examine whether the genetic subgroup could be related to diseases outcome and its potential implications for vaccine development.

摘要

自 1 月 11 日病毒的初始基因组序列在开放获取的病毒学网站（http://virological.org/）上发布以来，在当前的冠状病毒大流行期间，已经有超过 10000 个 SARS-CoV-2 病毒的基因组序列可供使用。我们利用 GISAID 数据库中 11132 名 SARS-CoV-2 患者的单链 RNA 发表数据，该数据库包含来自世界各地实验室的完全或部分测序的 SARS-CoV-2 样本。在目前正在研究的许多重要研究问题中，有一个方面涉及到病毒的遗传特征/分类。我们分析了 GISAID 数据库中可用的 7640 名 SARS-CoV-2 患者的病毒核苷酸测序和地理信息的子集数据，这些数据没有缺失项。我们没有采用建模突变率、应用系统发育树方法等方法，而是利用一种无模型的聚类方法，在全基因组水平上比较病毒。我们使用杰卡德指数，对一个相似性矩阵应用主成分分析，该矩阵同时比较所有这些 SARS-CoV-2 核苷酸序列在所有基因座的所有对。我们对 SARS-CoV-2 基因组数据的分析结果说明了病毒的地理和时间进展，从在中国首次观察到的病例到目前在欧洲和北美的病例浪潮。这与我们使用的系统发育分析一致，我们用它来对比我们的结果。我们还观察到，根据他们的序列数据，SARS-CoV-2 病毒聚类在不同的遗传亚群中。正在进行研究以检查遗传亚群是否与疾病结果有关，以及其对疫苗开发的潜在影响。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4758/8005425/90d15d493003/nihms-1650988-f0001.jpg

相似文献

Unsupervised cluster analysis of SARS-CoV-2 genomes reflects its geographic progression and identifies distinct genetic subgroups of SARS-CoV-2 virus.对 SARS-CoV-2 基因组进行无监督聚类分析反映了其地理进展，并确定了 SARS-CoV-2 病毒的不同遗传亚群。

Genet Epidemiol. 2021 Apr;45(3):316-323. doi: 10.1002/gepi.22373. Epub 2021 Jan 8.

Unsupervised cluster analysis of SARS-CoV-2 genomes reflects its geographic progression and identifies distinct genetic subgroups of SARS-CoV-2 virus.对严重急性呼吸综合征冠状病毒2（SARS-CoV-2）基因组进行的无监督聚类分析反映了其地理传播情况，并识别出了SARS-CoV-2病毒的不同基因亚群。

bioRxiv. 2020 Nov 20:2020.05.05.079061. doi: 10.1101/2020.05.05.079061.

Evolutionary and Phylogenetic Dynamics of SARS-CoV-2 Variants: A Genetic Comparative Study of Taiyuan and Wuhan Cities of China.SARS-CoV-2 变异株的进化与系统发育动态：中国太原市与武汉市的遗传比较研究。

Viruses. 2024 Jun 3;16(6):907. doi: 10.3390/v16060907.

Phylogenetic reconstruction of the initial stages of the spread of the SARS-CoV-2 virus in the Eurasian and American continents by analyzing genomic data.分析基因组数据对 SARS-CoV-2 病毒在欧亚大陆和美洲传播初始阶段的系统发育重建。

Virus Res. 2021 Nov;305:198551. doi: 10.1016/j.virusres.2021.198551. Epub 2021 Aug 26.

Genetic grouping of SARS-CoV-2 coronavirus sequences using informative subtype markers for pandemic spread visualization.利用有信息意义的亚型标记对 SARS-CoV-2 冠状病毒序列进行遗传分组，以可视化大流行传播。

PLoS Comput Biol. 2020 Sep 17;16(9):e1008269. doi: 10.1371/journal.pcbi.1008269. eCollection 2020 Sep.

Unsupervised cluster analysis of SARS-CoV-2 genomes indicates that recent (June 2020) cases in Beijing are from a genetic subgroup that consists of mostly European and South(east) Asian samples, of which the latter are the most recent.对新型冠状病毒基因组进行的无监督聚类分析表明，北京近期（2020年6月）的病例来自一个基因亚组，该亚组主要由欧洲和南亚（东南亚）样本组成，其中后者是最新的。

bioRxiv. 2020 Jun 30:2020.06.22.165936. doi: 10.1101/2020.06.22.165936.

Evidence of increasing diversification of emerging Severe Acute Respiratory Syndrome Coronavirus 2 strains.新兴严重急性呼吸综合征冠状病毒 2 株系多样化证据增加。

J Med Virol. 2020 Oct;92(10):2165-2172. doi: 10.1002/jmv.26018. Epub 2020 Aug 2.

Identification of Epidemiological Traits by Analysis of SARS-CoV-2 Sequences.通过分析 SARS-CoV-2 序列鉴定流行病学特征。

Viruses. 2021 Apr 27;13(5):764. doi: 10.3390/v13050764.

Large-scale sequencing of SARS-CoV-2 genomes from one region allows detailed epidemiology and enables local outbreak management.对一个地区的 SARS-CoV-2 基因组进行大规模测序可以提供详细的流行病学信息，并有助于当地疫情管理。

Microb Genom. 2021 Jun;7(6). doi: 10.1099/mgen.0.000589.

Taxonium, a web-based tool for exploring large phylogenetic trees.Taxonium，一个用于探索大型系统发育树的网络工具。

Elife. 2022 Nov 15;11:e82392. doi: 10.7554/eLife.82392.

引用本文的文献

Unsupervised outlier detection applied to SARS-CoV-2 nucleotide sequences can identify sequences of common variants and other variants of interest.无监督异常值检测应用于 SARS-CoV-2 核苷酸序列，可以识别常见变异序列和其他感兴趣的变异序列。

BMC Bioinformatics. 2022 Dec 19;23(1):547. doi: 10.1186/s12859-022-05105-y.

Genomic and structural mechanistic insight to reveal the differential infectivity of omicron and other variants of concern.揭示奥密克戎和其他关注变体差异感染性的基因组和结构机制见解。

Comput Biol Med. 2022 Nov;150:106129. doi: 10.1016/j.compbiomed.2022.106129. Epub 2022 Sep 22.

COVID-19: Integrating genomic and epidemiological data to inform public health interventions and policy in Tasmania, Australia.新冠病毒肺炎（COVID-19）：整合基因组学和流行病学数据以为澳大利亚塔斯马尼亚州的公共卫生干预和政策提供信息。

Western Pac Surveill Response J. 2021 Dec 22;12(4):1-9. doi: 10.5365/wpsar.2021.12.4.878. eCollection 2021 Oct-Dec.

Genome-wide association analysis of COVID-19 mortality risk in SARS-CoV-2 genomes identifies mutation in the SARS-CoV-2 spike protein that colocalizes with P.1 of the Brazilian strain.全基因组关联分析 SARS-CoV-2 基因组中 COVID-19 死亡率风险，鉴定出与巴西变异株 P.1 共定位的 SARS-CoV-2 刺突蛋白突变。

Genet Epidemiol. 2021 Oct;45(7):685-693. doi: 10.1002/gepi.22421. Epub 2021 Jun 22.

Genome-wide analysis of 10664 SARS-CoV-2 genomes to identify virus strains in 73 countries based on single nucleotide polymorphism.对 10664 个 SARS-CoV-2 基因组进行全基因组分析，基于单核苷酸多态性在 73 个国家识别病毒株。

Virus Res. 2021 Jun;298:198401. doi: 10.1016/j.virusres.2021.198401. Epub 2021 Mar 26.

本文引用的文献

Accommodating individual travel history and unsampled diversity in Bayesian phylogeographic inference of SARS-CoV-2.贝叶斯系统地理学推断 SARS-CoV-2 中考虑个体旅行史和未采样多样性。

Nat Commun. 2020 Oct 9;11(1):5110. doi: 10.1038/s41467-020-18877-9.

locStra: Fast analysis of regional/global stratification in whole-genome sequencing studies.快速分析全基因组测序研究中的区域/全局分层。

Genet Epidemiol. 2021 Feb;45(1):82-98. doi: 10.1002/gepi.22356. Epub 2020 Sep 14.

Genotype and phenotype of COVID-19: Their roles in pathogenesis.新型冠状病毒肺炎的基因型和表型：在发病机制中的作用。

J Microbiol Immunol Infect. 2021 Apr;54(2):159-163. doi: 10.1016/j.jmii.2020.03.022. Epub 2020 Mar 31.

COVID-19 in a Long-Term Care Facility - King County, Washington, February 27-March 9, 2020.2020 年 2 月 27 日至 3 月 9 日，华盛顿州金县长期护理机构发生的 2019 冠状病毒病疫情。

MMWR Morb Mortal Wkly Rep. 2020 Mar 27;69(12):339-342. doi: 10.15585/mmwr.mm6912e1.

Host susceptibility to severe COVID-19 and establishment of a host risk score: findings of 487 cases outside Wuhan.宿主对重症新型冠状病毒肺炎的易感性及宿主风险评分的建立：武汉以外地区487例病例的研究结果

Crit Care. 2020 Mar 18;24(1):108. doi: 10.1186/s13054-020-2833-7.

Clinical course and risk factors for mortality of adult inpatients with COVID-19 in Wuhan, China: a retrospective cohort study.中国武汉成人 COVID-19 住院患者的临床病程和死亡危险因素：一项回顾性队列研究。

Lancet. 2020 Mar 28;395(10229):1054-1062. doi: 10.1016/S0140-6736(20)30566-3. Epub 2020 Mar 11.

Data, disease and diplomacy: GISAID's innovative contribution to global health.数据、疾病与外交：全球共享流感数据倡议组织对全球健康的创新贡献。

Glob Chall. 2017 Jan 10;1(1):33-46. doi: 10.1002/gch2.1018. eCollection 2017 Jan.

GISAID: Global initiative on sharing all influenza data - from vision to reality.全球流感数据共享倡议组织：从愿景到现实的全球共享所有流感数据倡议

Euro Surveill. 2017 Mar 30;22(13). doi: 10.2807/1560-7917.ES.2017.22.13.30494.

Identification of genetic outliers due to sub-structure and cryptic relationships.由于亚结构和隐性关系导致的遗传异常值的识别。

Bioinformatics. 2017 Jul 1;33(13):1972-1979. doi: 10.1093/bioinformatics/btx109.

Utilizing the Jaccard index to reveal population stratification in sequencing data: a simulation study and an application to the 1000 Genomes Project.利用杰卡德指数揭示测序数据中的群体分层：一项模拟研究及对千人基因组计划的应用

Bioinformatics. 2016 May 1;32(9):1366-72. doi: 10.1093/bioinformatics/btv752. Epub 2015 Dec 31.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验