Transitional Artificial Intelligence Research Group, School of Mathematics and Statistics, UNSW Sydney, Sydney, Australia.
Department of Computer Science and Information Systems, Birla Institute of Technology and Science Pilani, Rajasthan, India.
PLoS One. 2023 May 18;18(5):e0285719. doi: 10.1371/journal.pone.0285719. eCollection 2023.
Due to the high mutation rate of the virus, the COVID-19 pandemic evolved rapidly. Certain variants of the virus, such as Delta and Omicron emerged with altered viral properties leading to severe transmission and death rates. These variants burdened the medical systems worldwide with a major impact to travel, productivity, and the world economy. Unsupervised machine learning methods have the ability to compress, characterize, and visualize unlabelled data. This paper presents a framework that utilizes unsupervised machine learning methods to discriminate and visualize the associations between major COVID-19 variants based on their genome sequences. These methods comprise a combination of selected dimensionality reduction and clustering techniques. The framework processes the RNA sequences by performing a k-mer analysis on the data and further visualises and compares the results using selected dimensionality reduction methods that include principal component analysis (PCA), t-distributed stochastic neighbour embedding (t-SNE), and uniform manifold approximation projection (UMAP). Our framework also employs agglomerative hierarchical clustering to visualize the mutational differences among major variants of concern and country-wise mutational differences for selected variants (Delta and Omicron) using dendrograms. We also provide country-wise mutational differences for selected variants via dendrograms. We find that the proposed framework can effectively distinguish between the major variants and has the potential to identify emerging variants in the future.
由于病毒的高突变率,COVID-19 大流行迅速演变。某些病毒变体(如 Delta 和 Omicron)出现了改变的病毒特性,导致严重的传播和死亡率。这些变体给全球的医疗系统带来了巨大的负担,对旅行、生产力和世界经济产生了重大影响。无监督机器学习方法能够压缩、刻画和可视化未标记的数据。本文提出了一种利用无监督机器学习方法来区分和可视化主要 COVID-19 变体基于其基因组序列的关联的框架。这些方法包括选定的降维和聚类技术的组合。该框架通过对数据执行 k-mer 分析来处理 RNA 序列,并使用选定的降维方法(包括主成分分析 (PCA)、t 分布随机邻域嵌入 (t-SNE) 和均匀流形逼近投影 (UMAP))进一步可视化和比较结果。我们的框架还使用凝聚层次聚类来使用树状图可视化主要变体和选定变体(Delta 和 Omicron)之间的突变差异以及选定变体的国家间突变差异。我们还通过树状图提供选定变体的国家间突变差异。我们发现,所提出的框架可以有效地区分主要变体,并有可能在未来识别新出现的变体。