利用基因组特征，通过机器学习和深度学习技术深入了解 SARS-CoV-2 的动态。

Utilizing genomic signatures to gain insights into the dynamics of SARS-CoV-2 through Machine and Deep Learning techniques.

机构信息

Bioinformatics Group, Center for Informatics Science, School of Information Technology and Computer Science, Nile University, Giza, Egypt.

Microbiology Unit, Belgian Nuclear Research Centre (SCK•CEN), Mol, Belgium.

出版信息

BMC Bioinformatics. 2024 Mar 27;25(1):131. doi: 10.1186/s12859-024-05648-2.

DOI:10.1186/s12859-024-05648-2

PMID:38539073

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10967124/

Abstract

The global spread of the SARS-CoV-2 pandemic, originating in Wuhan, China, has had profound consequences on both health and the economy. Traditional alignment-based phylogenetic tree methods for tracking epidemic dynamics demand substantial computational power due to the growing number of sequenced strains. Consequently, there is a pressing need for an alignment-free approach to characterize these strains and monitor the dynamics of various variants. In this work, we introduce a swift and straightforward tool named GenoSig, implemented in C++. The tool exploits the Di and Tri nucleotide frequency signatures to delineate the taxonomic lineages of SARS-CoV-2 by employing diverse machine learning (ML) and deep learning (DL) models. Our approach achieved a tenfold cross-validation accuracy of 87.88% (± 0.013) for DL and 86.37% (± 0.0009) for Random Forest (RF) model, surpassing the performance of other ML models. Validation using an additional unexposed dataset yielded comparable results. Despite variations in architectures between DL and RF, it was observed that later clades, specifically GRA, GRY, and GK, exhibited superior performance compared to earlier clades G and GH. As for the continental origin of the virus, both DL and RF models exhibited lower performance than in predicting clades. However, both models demonstrated relatively higher accuracy for Europe, North America, and South America compared to other continents, with DL outperforming RF. Both models consistently demonstrated a preference for cytosine and guanine over adenine and thymine in both clade and continental analyses, in both Di and Tri nucleotide frequencies signatures. Our findings suggest that GenoSig provides a straightforward approach to address taxonomic, epidemiological, and biological inquiries, utilizing a reductive method applicable not only to SARS-CoV-2 but also to similar research questions in an alignment-free context.

摘要

新型冠状病毒（SARS-CoV-2）疫情在全球蔓延，最初在中国武汉爆发，对健康和经济都产生了深远影响。传统的基于比对的系统发生树方法在追踪疫情动态时需要大量的计算能力，因为测序株的数量不断增加。因此，迫切需要一种无比对的方法来描述这些毒株，并监测各种变体的动态。在这项工作中，我们引入了一个名为 GenoSig 的快速而简单的工具，它是用 C++实现的。该工具利用二核苷酸和三核苷酸频率特征，通过使用各种机器学习（ML）和深度学习（DL）模型来描绘 SARS-CoV-2 的分类谱系。我们的方法在 10 倍交叉验证中的准确率为 87.88%（±0.013）（DL）和 86.37%（±0.0009）（随机森林（RF）模型），优于其他 ML 模型的性能。使用额外的未暴露数据集进行验证也得到了类似的结果。尽管 DL 和 RF 模型的架构不同，但观察到后期分支，特别是 GRA、GRY 和 GK，比早期分支 G 和 GH 表现出更好的性能。至于病毒的大陆起源，DL 和 RF 模型在预测分支方面的表现都不如在预测大陆方面的表现。然而，与其他大陆相比，这两个模型对欧洲、北美和南美表现出相对更高的准确性，DL 模型的表现优于 RF 模型。这两个模型在对碱基和大陆的分析中，都一致地表现出对胞嘧啶和鸟嘌呤的偏好，而不是对腺嘌呤和胸腺嘧啶的偏好，无论是在二核苷酸还是三核苷酸频率特征中。我们的研究结果表明，GenoSig 提供了一种简单的方法来解决分类、流行病学和生物学方面的问题，采用了一种适用于不仅 SARS-CoV-2 而且在无比对背景下解决类似研究问题的简化方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e524/10967124/449139fc31e0/12859_2024_5648_Fig1_HTML.jpg

相似文献

Utilizing genomic signatures to gain insights into the dynamics of SARS-CoV-2 through Machine and Deep Learning techniques.利用基因组特征，通过机器学习和深度学习技术深入了解 SARS-CoV-2 的动态。

BMC Bioinformatics. 2024 Mar 27;25(1):131. doi: 10.1186/s12859-024-05648-2.

Contrasting Epidemiology and Population Genetics of COVID-19 Infections Defined by Multilocus Genotypes in SARS-CoV-2 Genomes Sampled Globally.从全球采集的 SARS-CoV-2 基因组中的多位点基因型定义的 COVID-19 感染的对比流行病学和群体遗传学。

Viruses. 2022 Jun 29;14(7):1434. doi: 10.3390/v14071434.

Accurate and fast clade assignment via deep learning and frequency chaos game representation.通过深度学习和频率混沌游戏表示实现准确快速的进化枝分配。

Gigascience. 2022 Dec 28;12. doi: 10.1093/gigascience/giac119.

Genomic surveillance of SARS-CoV-2 strains circulating in Iran during six waves of the pandemic.伊朗在大流行的六波疫情中流行的 SARS-CoV-2 株的基因组监测。

Influenza Other Respir Viruses. 2023 Apr 16;17(4):e13135. doi: 10.1111/irv.13135. eCollection 2023 Apr.

Characterisation of SARS-CoV-2 clades based on signature SNPs unveils continuous evolution.基于特征 SNP 对 SARS-CoV-2 进化枝进行特征描述揭示了其持续进化。

Methods. 2022 Jul;203:282-296. doi: 10.1016/j.ymeth.2021.09.005. Epub 2021 Sep 20.

Evolutionary and Phylogenetic Dynamics of SARS-CoV-2 Variants: A Genetic Comparative Study of Taiyuan and Wuhan Cities of China.SARS-CoV-2 变异株的进化与系统发育动态：中国太原市与武汉市的遗传比较研究。

Viruses. 2024 Jun 3;16(6):907. doi: 10.3390/v16060907.

Machine learning-based approach KEVOLVE efficiently identifies SARS-CoV-2 variant-specific genomic signatures.基于机器学习的方法 KEVOLVE 能够有效地识别 SARS-CoV-2 变异特异性基因组特征。

PLoS One. 2024 Jan 19;19(1):e0296627. doi: 10.1371/journal.pone.0296627. eCollection 2024.

Distinct mutations and lineages of SARS-CoV-2 virus in the early phase of COVID-19 pandemic and subsequent 1-year global expansion.新冠病毒在 COVID-19 大流行早期阶段以及随后 1 年全球扩张过程中的不同突变和谱系。

J Med Virol. 2022 May;94(5):2035-2049. doi: 10.1002/jmv.27580. Epub 2022 Jan 18.

Prediction of death status on the course of treatment in SARS-COV-2 patients with deep learning and machine learning methods.利用深度学习和机器学习方法预测 SARS-CoV-2 患者治疗过程中的死亡状态。

Comput Methods Programs Biomed. 2021 Apr;201:105951. doi: 10.1016/j.cmpb.2021.105951. Epub 2021 Jan 22.

Taxonium, a web-based tool for exploring large phylogenetic trees.Taxonium，一个用于探索大型系统发育树的网络工具。

Elife. 2022 Nov 15;11:e82392. doi: 10.7554/eLife.82392.

引用本文的文献

AI-driven techniques for detection and mitigation of SARS-CoV-2 spread: a review, taxonomy, and trends.用于检测和缓解新冠病毒传播的人工智能驱动技术：综述、分类及趋势

Clin Exp Med. 2025 Jun 14;25(1):204. doi: 10.1007/s10238-025-01753-5.

本文引用的文献

MerCat2: a versatile -mer counter and diversity estimator for database-independent property analysis obtained from omics data.MerCat2：一种多功能的-mer计数器和多样性估计器，用于从组学数据中进行独立于数据库的属性分析。

Bioinform Adv. 2024 Apr 24;4(1):vbae061. doi: 10.1093/bioadv/vbae061. eCollection 2024.

The outbreak of SARS-CoV-2 Omicron lineages, immune escape, and vaccine effectivity.奥密克戎变异株引发的 SARS-CoV-2 疫情、免疫逃逸和疫苗效力。

J Med Virol. 2023 Jan;95(1):e28138. doi: 10.1002/jmv.28138. Epub 2022 Sep 21.

Comparative analysis of the risks of hospitalisation and death associated with SARS-CoV-2 omicron (B.1.1.529) and delta (B.1.617.2) variants in England: a cohort study.比较分析英国住院和死亡风险与 SARS-CoV-2 奥密克戎（B.1.1.529）和德尔塔（B.1.617.2）变异株的关系：一项队列研究。

Lancet. 2022 Apr 2;399(10332):1303-1312. doi: 10.1016/S0140-6736(22)00462-7. Epub 2022 Mar 16.

Classification of COVID-19 and Other Pathogenic Sequences: A Dinucleotide Frequency and Machine Learning Approach.新型冠状病毒肺炎及其他致病序列的分类：一种二核苷酸频率与机器学习方法

IEEE Access. 2020 Oct 15;8:195263-195273. doi: 10.1109/ACCESS.2020.3031387. eCollection 2020.

Establishing reference sequences for each clade of SARS-CoV-2 to provide a basis for virus variation and function research.建立每个 SARS-CoV-2 分支的参考序列，为病毒变异和功能研究提供基础。

J Med Virol. 2022 Apr;94(4):1494-1501. doi: 10.1002/jmv.27476. Epub 2021 Dec 1.

Characterisation of SARS-CoV-2 clades based on signature SNPs unveils continuous evolution.基于特征 SNP 对 SARS-CoV-2 进化枝进行特征描述揭示了其持续进化。

Methods. 2022 Jul;203:282-296. doi: 10.1016/j.ymeth.2021.09.005. Epub 2021 Sep 20.

Overview of SARS-CoV-2 genome-encoded proteins.SARS-CoV-2 基因组编码蛋白概述。

Sci China Life Sci. 2022 Feb;65(2):280-294. doi: 10.1007/s11427-021-1964-4. Epub 2021 Aug 10.

A hybrid computational framework for intelligent inter-continent SARS-CoV-2 sub-strains characterization and prediction.用于智能洲际 SARS-CoV-2 亚系特征描述和预测的混合计算框架。

Sci Rep. 2021 Jul 15;11(1):14558. doi: 10.1038/s41598-021-93757-w.

SARS-CoV-2: Origin, Evolution, and Targeting Inhibition.SARS-CoV-2：起源、进化与靶向抑制。

Front Cell Infect Microbiol. 2021 Jun 17;11:676451. doi: 10.3389/fcimb.2021.676451. eCollection 2021.

Mutational Asymmetries in the SARS-CoV-2 Genome May Lead to Increased Hydrophobicity of Virus Proteins.SARS-CoV-2 基因组中的突变不对称性可能导致病毒蛋白的疏水性增加。

Genes (Basel). 2021 May 27;12(6):826. doi: 10.3390/genes12060826.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

利用基因组特征，通过机器学习和深度学习技术深入了解 SARS-CoV-2 的动态。

Utilizing genomic signatures to gain insights into the dynamics of SARS-CoV-2 through Machine and Deep Learning techniques.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献