van Zyl Daniel J, Dunaiski Marcel, Tegally Houriiyah, Baxter Cheryl, de Oliveira Tulio, Xavier Joicymara S
Centre for Epidemic Response and Innovation (CERI), School of Data Science and Computational Thinking, Stellenbosch University, Stellenbosch, South Africa.
Computer Science Division, Department of Mathematical Sciences, Faculty of Science, Stellenbosch University, Stellenbosch, South Africa.
bioRxiv. 2024 Dec 11:2024.12.10.627186. doi: 10.1101/2024.12.10.627186.
The rapid increase in nucleotide sequence data generated by next-generation sequencing (NGS) technologies demands efficient computational tools for sequence comparison. Alignment-based methods, such as BLAST, are increasingly overwhelmed by the scale of contemporary datasets due to their high computational demands for classification. This study evaluates alignment-free (AF) methods as scalable and rapid alternatives for viral sequence classification, focusing on identifying techniques that maintain high accuracy and efficiency when applied to extremely large datasets.
We employed six established AF techniques to extract feature vectors from viral genomes, which were subsequently used to train Random Forest classifiers. Our primary dataset comprises 297,186 SARS-CoV-2 nucleotide sequences, categorized into 3502 distinct lineages. Furthermore, we validated our models using dengue and HIV sequences to demonstrate robustness across different viral datasets. Our AF classifiers achieved 97.8% accuracy on the SARS-CoV-2 test set, and 99.8% and 89.1% accuracy on dengue and HIV test sets, respectively.
Despite the high-class dimensionality, we show that word-based AF methods effectively represent viral sequences. Our study highlights the practical advantages of AF techniques, including significantly faster processing compared to alignment-based methods and the ability to classify sequences using modest computational resources.
下一代测序(NGS)技术产生的核苷酸序列数据迅速增加,这就需要高效的计算工具来进行序列比较。基于比对的方法,如BLAST,由于其对分类的高计算需求,越来越难以应对当代数据集的规模。本研究评估了无比对(AF)方法作为病毒序列分类的可扩展且快速的替代方法,重点是识别在应用于超大型数据集时能保持高精度和高效率的技术。
我们采用了六种既定的AF技术从病毒基因组中提取特征向量,随后用于训练随机森林分类器。我们的主要数据集包含297,186个SARS-CoV-2核苷酸序列,分为3502个不同的谱系。此外,我们使用登革热和HIV序列验证了我们的模型,以证明其在不同病毒数据集上的稳健性。我们的AF分类器在SARS-CoV-2测试集上的准确率达到97.8%,在登革热和HIV测试集上的准确率分别为99.8%和89.1%。
尽管类别维度很高,但我们表明基于词的AF方法能够有效地表示病毒序列。我们的研究突出了AF技术的实际优势,包括与基于比对的方法相比处理速度明显更快,以及能够使用适度的计算资源对序列进行分类。