Li Yawei, Liu Qingyun, Zeng Zexian, Luo Yuan
Department of Preventive Medicine, Northwestern University, Feinberg School of Medicine, Chicago, IL 60611, USA.
Department of Immunology and Infectious Diseases, Harvard T. H. Chan School of Public Health, Boston, MA 02115, USA.
bioRxiv. 2021 Nov 24:2020.09.04.283358. doi: 10.1101/2020.09.04.283358.
Identifying the population structure of the newly emerged coronavirus SARS-CoV-2 has significant potential to inform public health management and diagnosis. As SARS-CoV-2 sequencing data accrued, grouping them into clusters is important for organizing the landscape of the population structure of the virus. Due to the limited prior information on the newly emerged coronavirus, we utilized four different clustering algorithms to group 16,873 SARS-CoV-2 strains, which automatically enables the identification of spatial structure for SARS-CoV-2. A total of six distinct genomic clusters were identified using mutation profiles as input features. Comparison of the clustering results reveals that the four algorithms produced highly consistent results, but the state-of-the-art unsupervised deep learning clustering algorithm performed best and produced the smallest intra-cluster pairwise genetic distances. The varied proportions of the six clusters within different continents revealed specific geographical distributions. In particular, our analysis found that Oceania was the only continent on which the strains were dispersively distributed into six clusters. In summary, this study provides a concrete framework for the use of clustering methods to study the global population structure of SARS-CoV-2. In addition, clustering methods can be used for future studies of variant population structures in specific regions of these fast-growing viruses.
识别新出现的冠状病毒SARS-CoV-2的种群结构对于指导公共卫生管理和诊断具有重要潜力。随着SARS-CoV-2测序数据的积累,将它们分组为簇对于梳理病毒种群结构的全貌很重要。由于关于新出现的冠状病毒的先验信息有限,我们使用了四种不同的聚类算法对16873个SARS-CoV-2毒株进行分组,这自动实现了对SARS-CoV-2空间结构的识别。以突变谱作为输入特征,共识别出六个不同的基因组簇。聚类结果的比较表明,这四种算法产生了高度一致的结果,但最先进的无监督深度学习聚类算法表现最佳,产生的簇内成对遗传距离最小。六个簇在不同大陆中的比例各不相同,揭示了特定的地理分布。特别是,我们的分析发现大洋洲是唯一一个毒株分散分布在六个簇中的大陆。总之,本研究为使用聚类方法研究SARS-CoV-2的全球种群结构提供了一个具体框架。此外,聚类方法可用于未来对这些快速传播病毒特定区域变异种群结构的研究。