Suppr超能文献

利用深度学习从宏基因组数据中识别病毒。

Identifying viruses from metagenomic data using deep learning.

作者信息

Ren Jie, Song Kai, Deng Chao, Ahlgren Nathan A, Fuhrman Jed A, Li Yi, Xie Xiaohui, Poplin Ryan, Sun Fengzhu

机构信息

Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA 90089, USA.

School of Mathematics and Statistics, Qingdao University, Qingdao 266071, China.

出版信息

Quant Biol. 2020 Mar;8(1):64-77. doi: 10.1007/s40484-019-0187-4.

Abstract

BACKGROUND

The recent development of metagenomic sequencing makes it possible to massively sequence microbial genomes including viral genomes without the need for laboratory culture. Existing reference-based and gene homology-based methods are not efficient in identifying unknown viruses or short viral sequences from metagenomic data.

METHODS

Here we developed a reference-free and alignment-free machine learning method, DeepVirFinder, for identifying viral sequences in metagenomic data using deep learning.

RESULTS

Trained based on sequences from viral RefSeq discovered before May 2015, and evaluated on those discovered after that date, DeepVirFinder outperformed the state-of-the-art method VirFinder at all contig lengths, achieving AUROC 0.93, 0.95, 0.97, and 0.98 for 300, 500, 1000, and 3000 bp sequences respectively. Enlarging the training data with additional millions of purified viral sequences from metavirome samples further improved the accuracy for identifying virus groups that are under-represented. Applying DeepVirFinder to real human gut metagenomic samples, we identified 51,138 viral sequences belonging to 175 bins in patients with colorectal carcinoma (CRC). Ten bins were found associated with the cancer status, suggesting viruses may play important roles in CRC.

CONCLUSIONS

Powered by deep learning and high throughput sequencing metagenomic data, DeepVirFinder significantly improved the accuracy of viral identification and will assist the study of viruses in the era of metagenomics.

摘要

背景

宏基因组测序技术的最新发展使得对包括病毒基因组在内的微生物基因组进行大规模测序成为可能,而无需实验室培养。现有的基于参考序列和基因同源性的方法在从宏基因组数据中识别未知病毒或短病毒序列方面效率不高。

方法

在此,我们开发了一种无参考序列和无比对的机器学习方法DeepVirFinder,用于利用深度学习识别宏基因组数据中的病毒序列。

结果

基于2015年5月之前发现的病毒RefSeq序列进行训练,并在该日期之后发现的序列上进行评估,DeepVirFinder在所有重叠群长度上均优于当前最先进的方法VirFinder,对于300、500、1000和3000bp序列,分别实现了0.93、0.95、0.97和0.98的曲线下面积(AUROC)。用来自宏病毒组样本的数百万条纯化病毒序列扩大训练数据,进一步提高了识别代表性不足的病毒组的准确性。将DeepVirFinder应用于真实的人类肠道宏基因组样本,我们在结直肠癌(CRC)患者中鉴定出属于175个分类单元的51,138条病毒序列。发现有10个分类单元与癌症状态相关,这表明病毒可能在CRC中起重要作用。

结论

借助深度学习和高通量测序宏基因组数据,DeepVirFinder显著提高了病毒识别的准确性,并将有助于宏基因组学时代的病毒研究。

相似文献

1
Identifying viruses from metagenomic data using deep learning.
Quant Biol. 2020 Mar;8(1):64-77. doi: 10.1007/s40484-019-0187-4.
3
RNN-VirSeeker: A Deep Learning Method for Identification of Short Viral Sequences From Metagenomes.
IEEE/ACM Trans Comput Biol Bioinform. 2022 May-Jun;19(3):1840-1849. doi: 10.1109/TCBB.2020.3044575. Epub 2022 Jun 3.
4
Virtifier: a deep learning-based identifier for viral sequences from metagenomes.
Bioinformatics. 2022 Feb 7;38(5):1216-1222. doi: 10.1093/bioinformatics/btab845.
5
Reads Binning Improves the Assembly of Viral Genome Sequences From Metagenomic Samples.
Front Microbiol. 2021 May 21;12:664560. doi: 10.3389/fmicb.2021.664560. eCollection 2021.
7
Simulation study and comparative evaluation of viral contiguous sequence identification tools.
BMC Bioinformatics. 2021 Jun 16;22(1):329. doi: 10.1186/s12859-021-04242-0.
9
DETIRE: a hybrid deep learning model for identifying viral sequences from metagenomes.
Front Microbiol. 2023 Jun 16;14:1169791. doi: 10.3389/fmicb.2023.1169791. eCollection 2023.
10
ViraMiner: Deep learning on raw DNA sequences for identifying viral genomes in human samples.
PLoS One. 2019 Sep 11;14(9):e0222271. doi: 10.1371/journal.pone.0222271. eCollection 2019.

引用本文的文献

2
Phage quest: a beginner's guide to explore viral diversity in the prokaryotic world.
Brief Bioinform. 2025 Aug 31;26(5). doi: 10.1093/bib/bbaf449.
3
Global biogeography of airborne viruses in public transit systems and their host interactions.
Microbiome. 2025 Aug 29;13(1):193. doi: 10.1186/s40168-025-02173-z.
5
Unique plastisphere viromes with habitat-dependent potential for modulating global methane cycle.
Nat Commun. 2025 Aug 29;16(1):8098. doi: 10.1038/s41467-025-63215-6.
6
Advanced Strategies in Phage Research: Innovations, Applications, and Challenges.
Microorganisms. 2025 Aug 21;13(8):1960. doi: 10.3390/microorganisms13081960.
10
NextVir: Enabling classification of tumor-causing viruses with genomic foundation models.
PLoS Comput Biol. 2025 Aug 21;21(8):e1013360. doi: 10.1371/journal.pcbi.1013360. eCollection 2025 Aug.

本文引用的文献

1
Predicting enhancer-promoter interaction from genomic sequence with deep neural networks.
Quant Biol. 2019 Jun;7(2):122-137. doi: 10.1007/s40484-019-0154-0.
3
The Promises and Pitfalls of Machine Learning for Detecting Viruses in Aquatic Metagenomes.
Front Microbiol. 2019 Apr 16;10:806. doi: 10.3389/fmicb.2019.00806. eCollection 2019.
5
Mining, analyzing, and integrating viral signals from metagenomic data.
Microbiome. 2019 Mar 19;7(1):42. doi: 10.1186/s40168-019-0657-y.
6
The Pfam protein families database in 2019.
Nucleic Acids Res. 2019 Jan 8;47(D1):D427-D432. doi: 10.1093/nar/gky995.
7
A universal SNP and small-indel variant caller using deep neural networks.
Nat Biotechnol. 2018 Nov;36(10):983-987. doi: 10.1038/nbt.4235. Epub 2018 Sep 24.
8
MARVEL, a Tool for Prediction of Bacteriophage Sequences in Metagenomic Bins.
Front Genet. 2018 Aug 7;9:304. doi: 10.3389/fgene.2018.00304. eCollection 2018.
9
Genome-wide prediction of cis-regulatory regions using supervised deep learning methods.
BMC Bioinformatics. 2018 May 31;19(1):202. doi: 10.1186/s12859-018-2187-1.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验