Department of Computer Science, Georgia State University, Atlanta, Georgia, USA.
J Comput Biol. 2023 Apr;30(4):432-445. doi: 10.1089/cmb.2022.0391. Epub 2023 Jan 19.
With the rapid spread of COVID-19 worldwide, viral genomic data are available in the order of millions of sequences on public databases such as GISAID. This creates a unique opportunity for analysis toward the research of effective vaccine development for current pandemics, and avoiding or mitigating future pandemics. One piece of information that comes with every such viral sequence is the geographical location where it was collected-the patterns found between viral variants and geographical location surely being an important part of this analysis. One major challenge that researchers face is processing such huge, highly dimensional data to obtain useful insights as quickly as possible. Most of the existing methods face scalability issues when dealing with the magnitude of such data. In this article, we propose an approach that first computes a numerical representation of the spike protein sequence of SARS-CoV-2 using -mers (substrings) and then uses several machine learning models to classify the sequences based on geographical location. We show that our proposed model significantly outperforms the baselines. We also show the importance of different amino acids in the spike sequences by computing the information gain corresponding to the true class labels.
随着 COVID-19 在全球范围内的迅速传播,公共数据库(如 GISAID)中可获得数以百万计的病毒基因组数据。这为研究当前大流行的有效疫苗开发以及避免或减轻未来大流行提供了独特的分析机会。每一个这样的病毒序列都包含一个信息,即它是在何处采集的地理位置——在病毒变异体和地理位置之间发现的模式肯定是该分析的重要组成部分。研究人员面临的一个主要挑战是尽快处理如此庞大、高维的数据以获得有用的见解。大多数现有方法在处理如此庞大的数据量时都面临可扩展性问题。在本文中,我们提出了一种方法,首先使用 -mers(子字符串)计算 SARS-CoV-2 刺突蛋白序列的数值表示,然后使用几个机器学习模型根据地理位置对序列进行分类。我们表明,我们提出的模型明显优于基线模型。我们还通过计算对应于真实类标签的信息增益来展示刺突序列中不同氨基酸的重要性。