Department of Biostatistics, T.H. Chan School of Public Health, Harvard University, Boston, Massachusetts, USA.
Department of Medical Consilience, Graduate School, Dankook University, Yongin-si, South Korea.
Genet Epidemiol. 2021 Apr;45(3):316-323. doi: 10.1002/gepi.22373. Epub 2021 Jan 8.
Over 10,000 viral genome sequences of the SARS-CoV-2virus have been made readily available during the ongoing coronavirus pandemic since the initial genome sequence of the virus was released on the open access Virological website (http://virological.org/) early on January 11. We utilize the published data on the single stranded RNAs of 11,132 SARS-CoV-2 patients in the GISAID database, which contains fully or partially sequenced SARS-CoV-2 samples from laboratories around the world. Among many important research questions which are currently being investigated, one aspect pertains to the genetic characterization/classification of the virus. We analyze data on the nucleotide sequencing of the virus and geographic information of a subset of 7640 SARS-CoV-2 patients without missing entries that are available in the GISAID database. Instead of modeling the mutation rate, applying phylogenetic tree approaches, and so forth, we here utilize a model-free clustering approach that compares the viruses at a genome-wide level. We apply principal component analysis to a similarity matrix that compares all pairs of these SARS-CoV-2 nucleotide sequences at all loci simultaneously, using the Jaccard index. Our analysis results of the SARS-CoV-2 genome data illustrates the geographic and chronological progression of the virus, starting from the first cases that were observed in China to the current wave of cases in Europe and North America. This is in line with a phylogenetic analysis which we use to contrast our results. We also observe that, based on their sequence data, the SARS-CoV-2 viruses cluster in distinct genetic subgroups. It is the subject of ongoing research to examine whether the genetic subgroup could be related to diseases outcome and its potential implications for vaccine development.
自 1 月 11 日病毒的初始基因组序列在开放获取的病毒学网站(http://virological.org/)上发布以来,在当前的冠状病毒大流行期间,已经有超过 10000 个 SARS-CoV-2 病毒的基因组序列可供使用。我们利用 GISAID 数据库中 11132 名 SARS-CoV-2 患者的单链 RNA 发表数据,该数据库包含来自世界各地实验室的完全或部分测序的 SARS-CoV-2 样本。在目前正在研究的许多重要研究问题中,有一个方面涉及到病毒的遗传特征/分类。我们分析了 GISAID 数据库中可用的 7640 名 SARS-CoV-2 患者的病毒核苷酸测序和地理信息的子集数据,这些数据没有缺失项。我们没有采用建模突变率、应用系统发育树方法等方法,而是利用一种无模型的聚类方法,在全基因组水平上比较病毒。我们使用杰卡德指数,对一个相似性矩阵应用主成分分析,该矩阵同时比较所有这些 SARS-CoV-2 核苷酸序列在所有基因座的所有对。我们对 SARS-CoV-2 基因组数据的分析结果说明了病毒的地理和时间进展,从在中国首次观察到的病例到目前在欧洲和北美的病例浪潮。这与我们使用的系统发育分析一致,我们用它来对比我们的结果。我们还观察到,根据他们的序列数据,SARS-CoV-2 病毒聚类在不同的遗传亚群中。正在进行研究以检查遗传亚群是否与疾病结果有关,以及其对疫苗开发的潜在影响。