Bioinformatics Group, Center for Informatics Science, School of Information Technology and Computer Science, Nile University, Giza, Egypt.
Microbiology Unit, Belgian Nuclear Research Centre (SCK•CEN), Mol, Belgium.
BMC Bioinformatics. 2024 Mar 27;25(1):131. doi: 10.1186/s12859-024-05648-2.
The global spread of the SARS-CoV-2 pandemic, originating in Wuhan, China, has had profound consequences on both health and the economy. Traditional alignment-based phylogenetic tree methods for tracking epidemic dynamics demand substantial computational power due to the growing number of sequenced strains. Consequently, there is a pressing need for an alignment-free approach to characterize these strains and monitor the dynamics of various variants. In this work, we introduce a swift and straightforward tool named GenoSig, implemented in C++. The tool exploits the Di and Tri nucleotide frequency signatures to delineate the taxonomic lineages of SARS-CoV-2 by employing diverse machine learning (ML) and deep learning (DL) models. Our approach achieved a tenfold cross-validation accuracy of 87.88% (± 0.013) for DL and 86.37% (± 0.0009) for Random Forest (RF) model, surpassing the performance of other ML models. Validation using an additional unexposed dataset yielded comparable results. Despite variations in architectures between DL and RF, it was observed that later clades, specifically GRA, GRY, and GK, exhibited superior performance compared to earlier clades G and GH. As for the continental origin of the virus, both DL and RF models exhibited lower performance than in predicting clades. However, both models demonstrated relatively higher accuracy for Europe, North America, and South America compared to other continents, with DL outperforming RF. Both models consistently demonstrated a preference for cytosine and guanine over adenine and thymine in both clade and continental analyses, in both Di and Tri nucleotide frequencies signatures. Our findings suggest that GenoSig provides a straightforward approach to address taxonomic, epidemiological, and biological inquiries, utilizing a reductive method applicable not only to SARS-CoV-2 but also to similar research questions in an alignment-free context.
新型冠状病毒(SARS-CoV-2)疫情在全球蔓延,最初在中国武汉爆发,对健康和经济都产生了深远影响。传统的基于比对的系统发生树方法在追踪疫情动态时需要大量的计算能力,因为测序株的数量不断增加。因此,迫切需要一种无比对的方法来描述这些毒株,并监测各种变体的动态。在这项工作中,我们引入了一个名为 GenoSig 的快速而简单的工具,它是用 C++实现的。该工具利用二核苷酸和三核苷酸频率特征,通过使用各种机器学习(ML)和深度学习(DL)模型来描绘 SARS-CoV-2 的分类谱系。我们的方法在 10 倍交叉验证中的准确率为 87.88%(±0.013)(DL)和 86.37%(±0.0009)(随机森林(RF)模型),优于其他 ML 模型的性能。使用额外的未暴露数据集进行验证也得到了类似的结果。尽管 DL 和 RF 模型的架构不同,但观察到后期分支,特别是 GRA、GRY 和 GK,比早期分支 G 和 GH 表现出更好的性能。至于病毒的大陆起源,DL 和 RF 模型在预测分支方面的表现都不如在预测大陆方面的表现。然而,与其他大陆相比,这两个模型对欧洲、北美和南美表现出相对更高的准确性,DL 模型的表现优于 RF 模型。这两个模型在对碱基和大陆的分析中,都一致地表现出对胞嘧啶和鸟嘌呤的偏好,而不是对腺嘌呤和胸腺嘧啶的偏好,无论是在二核苷酸还是三核苷酸频率特征中。我们的研究结果表明,GenoSig 提供了一种简单的方法来解决分类、流行病学和生物学方面的问题,采用了一种适用于不仅 SARS-CoV-2 而且在无比对背景下解决类似研究问题的简化方法。