Department of Computer Science, Systems and Communications, University of Milano-Bicocca, Milan 20125, Italy.
Gigascience. 2022 Dec 28;12. doi: 10.1093/gigascience/giac119.
Since the beginning of the coronavirus disease 2019 pandemic, there has been an explosion of sequencing of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus, making it the most widely sequenced virus in the history. Several databases and tools have been created to keep track of genome sequences and variants of the virus; most notably, the GISAID platform hosts millions of complete genome sequences, and it is continuously expanding every day. A challenging task is the development of fast and accurate tools that are able to distinguish between the different SARS-CoV-2 variants and assign them to a clade.
In this article, we leverage the frequency chaos game representation (FCGR) and convolutional neural networks (CNNs) to develop an original method that learns how to classify genome sequences that we implement into CouGaR-g, a tool for the clade assignment problem on SARS-CoV-2 sequences. On a testing subset of the GISAID, CouGaR-g achieved an $96.29%$ overall accuracy, while a similar tool, Covidex, obtained a $77,12%$ overall accuracy. As far as we know, our method is the first using deep learning and FCGR for intraspecies classification. Furthermore, by using some feature importance methods, CouGaR-g allows to identify k-mers that match SARS-CoV-2 marker variants.
By combining FCGR and CNNs, we develop a method that achieves a better accuracy than Covidex (which is based on random forest) for clade assignment of SARS-CoV-2 genome sequences, also thanks to our training on a much larger dataset, with comparable running times. Our method implemented in CouGaR-g is able to detect k-mers that capture relevant biological information that distinguishes the clades, known as marker variants.
The trained models can be tested online providing a FASTA file (with 1 or multiple sequences) at https://huggingface.co/spaces/BIASLab/sars-cov-2-classification-fcgr. CouGaR-g is also available at https://github.com/AlgoLab/CouGaR-g under the GPL.
自 2019 年冠状病毒病大流行开始以来,对严重急性呼吸系统综合征冠状病毒 2 (SARS-CoV-2) 病毒的测序呈爆炸式增长,使其成为历史上测序最多的病毒。已经创建了几个数据库和工具来跟踪病毒的基因组序列和变体;其中最著名的是 GISAID 平台,它拥有数百万个完整的基因组序列,并且每天都在不断扩大。一个具有挑战性的任务是开发能够区分不同 SARS-CoV-2 变体并将其分配到一个进化枝的快速而准确的工具。
在本文中,我们利用频率混沌游戏表示 (FCGR) 和卷积神经网络 (CNN) 来开发一种原始方法,该方法学习如何对我们实现到 CouGaR-g 中的基因组序列进行分类,CouGaR-g 是一种用于 SARS-CoV-2 序列进化枝分配问题的工具。在 GISAID 的一个测试子集中,CouGaR-g 实现了 96.29%的总体准确率,而类似的工具 Covidex 则获得了 77.12%的总体准确率。据我们所知,我们的方法是第一个使用深度学习和 FCGR 进行种内分类的方法。此外,通过使用一些特征重要性方法,CouGaR-g 可以识别与 SARS-CoV-2 标记变体匹配的 K-mer。
通过结合 FCGR 和 CNN,我们开发了一种方法,与基于随机森林的 Covidex 相比,该方法在 SARS-CoV-2 基因组序列的进化枝分配方面具有更高的准确性,这也要归功于我们在更大的数据集上进行了训练,并且具有可比的运行时间。我们在 CouGaR-g 中实现的方法能够检测到捕获区分进化枝的相关生物学信息的 K-mer,这些进化枝被称为标记变体。
在线测试模型时,可在 https://huggingface.co/spaces/BIASLab/sars-cov-2-classification-fcgr 提供一个 FASTA 文件(包含 1 个或多个序列)。CouGaR-g 也可以在 https://github.com/AlgoLab/CouGaR-g 下的 GPL 协议下使用。