Suppr超能文献

通过深度学习和频率混沌游戏表示实现准确快速的进化枝分配。

Accurate and fast clade assignment via deep learning and frequency chaos game representation.

机构信息

Department of Computer Science, Systems and Communications, University of Milano-Bicocca, Milan 20125, Italy.

出版信息

Gigascience. 2022 Dec 28;12. doi: 10.1093/gigascience/giac119.

Abstract

BACKGROUND

Since the beginning of the coronavirus disease 2019 pandemic, there has been an explosion of sequencing of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus, making it the most widely sequenced virus in the history. Several databases and tools have been created to keep track of genome sequences and variants of the virus; most notably, the GISAID platform hosts millions of complete genome sequences, and it is continuously expanding every day. A challenging task is the development of fast and accurate tools that are able to distinguish between the different SARS-CoV-2 variants and assign them to a clade.

RESULTS

In this article, we leverage the frequency chaos game representation (FCGR) and convolutional neural networks (CNNs) to develop an original method that learns how to classify genome sequences that we implement into CouGaR-g, a tool for the clade assignment problem on SARS-CoV-2 sequences. On a testing subset of the GISAID, CouGaR-g achieved an $96.29%$ overall accuracy, while a similar tool, Covidex, obtained a $77,12%$ overall accuracy. As far as we know, our method is the first using deep learning and FCGR for intraspecies classification. Furthermore, by using some feature importance methods, CouGaR-g allows to identify k-mers that match SARS-CoV-2 marker variants.

CONCLUSIONS

By combining FCGR and CNNs, we develop a method that achieves a better accuracy than Covidex (which is based on random forest) for clade assignment of SARS-CoV-2 genome sequences, also thanks to our training on a much larger dataset, with comparable running times. Our method implemented in CouGaR-g is able to detect k-mers that capture relevant biological information that distinguishes the clades, known as marker variants.

AVAILABILITY

The trained models can be tested online providing a FASTA file (with 1 or multiple sequences) at https://huggingface.co/spaces/BIASLab/sars-cov-2-classification-fcgr. CouGaR-g is also available at https://github.com/AlgoLab/CouGaR-g under the GPL.

摘要

背景

自 2019 年冠状病毒病大流行开始以来,对严重急性呼吸系统综合征冠状病毒 2 (SARS-CoV-2) 病毒的测序呈爆炸式增长,使其成为历史上测序最多的病毒。已经创建了几个数据库和工具来跟踪病毒的基因组序列和变体;其中最著名的是 GISAID 平台,它拥有数百万个完整的基因组序列,并且每天都在不断扩大。一个具有挑战性的任务是开发能够区分不同 SARS-CoV-2 变体并将其分配到一个进化枝的快速而准确的工具。

结果

在本文中,我们利用频率混沌游戏表示 (FCGR) 和卷积神经网络 (CNN) 来开发一种原始方法,该方法学习如何对我们实现到 CouGaR-g 中的基因组序列进行分类,CouGaR-g 是一种用于 SARS-CoV-2 序列进化枝分配问题的工具。在 GISAID 的一个测试子集中,CouGaR-g 实现了 96.29%的总体准确率,而类似的工具 Covidex 则获得了 77.12%的总体准确率。据我们所知,我们的方法是第一个使用深度学习和 FCGR 进行种内分类的方法。此外,通过使用一些特征重要性方法,CouGaR-g 可以识别与 SARS-CoV-2 标记变体匹配的 K-mer。

结论

通过结合 FCGR 和 CNN,我们开发了一种方法,与基于随机森林的 Covidex 相比,该方法在 SARS-CoV-2 基因组序列的进化枝分配方面具有更高的准确性,这也要归功于我们在更大的数据集上进行了训练,并且具有可比的运行时间。我们在 CouGaR-g 中实现的方法能够检测到捕获区分进化枝的相关生物学信息的 K-mer,这些进化枝被称为标记变体。

可用性

在线测试模型时,可在 https://huggingface.co/spaces/BIASLab/sars-cov-2-classification-fcgr 提供一个 FASTA 文件(包含 1 个或多个序列)。CouGaR-g 也可以在 https://github.com/AlgoLab/CouGaR-g 下的 GPL 协议下使用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3bec/9795481/f55ff582101c/giac119fig1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验