利用深度学习从宏基因组数据中识别病毒。

Identifying viruses from metagenomic data using deep learning.

作者信息

Ren Jie, Song Kai, Deng Chao, Ahlgren Nathan A, Fuhrman Jed A, Li Yi, Xie Xiaohui, Poplin Ryan, Sun Fengzhu

机构信息

Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA 90089, USA.

School of Mathematics and Statistics, Qingdao University, Qingdao 266071, China.

出版信息

Quant Biol. 2020 Mar;8(1):64-77. doi: 10.1007/s40484-019-0187-4.

DOI:10.1007/s40484-019-0187-4

PMID:34084563

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8172088/

Abstract

BACKGROUND

The recent development of metagenomic sequencing makes it possible to massively sequence microbial genomes including viral genomes without the need for laboratory culture. Existing reference-based and gene homology-based methods are not efficient in identifying unknown viruses or short viral sequences from metagenomic data.

METHODS

Here we developed a reference-free and alignment-free machine learning method, DeepVirFinder, for identifying viral sequences in metagenomic data using deep learning.

RESULTS

Trained based on sequences from viral RefSeq discovered before May 2015, and evaluated on those discovered after that date, DeepVirFinder outperformed the state-of-the-art method VirFinder at all contig lengths, achieving AUROC 0.93, 0.95, 0.97, and 0.98 for 300, 500, 1000, and 3000 bp sequences respectively. Enlarging the training data with additional millions of purified viral sequences from metavirome samples further improved the accuracy for identifying virus groups that are under-represented. Applying DeepVirFinder to real human gut metagenomic samples, we identified 51,138 viral sequences belonging to 175 bins in patients with colorectal carcinoma (CRC). Ten bins were found associated with the cancer status, suggesting viruses may play important roles in CRC.

CONCLUSIONS

Powered by deep learning and high throughput sequencing metagenomic data, DeepVirFinder significantly improved the accuracy of viral identification and will assist the study of viruses in the era of metagenomics.

摘要

背景

宏基因组测序技术的最新发展使得对包括病毒基因组在内的微生物基因组进行大规模测序成为可能，而无需实验室培养。现有的基于参考序列和基因同源性的方法在从宏基因组数据中识别未知病毒或短病毒序列方面效率不高。

方法

在此，我们开发了一种无参考序列和无比对的机器学习方法DeepVirFinder，用于利用深度学习识别宏基因组数据中的病毒序列。

结果

基于2015年5月之前发现的病毒RefSeq序列进行训练，并在该日期之后发现的序列上进行评估，DeepVirFinder在所有重叠群长度上均优于当前最先进的方法VirFinder，对于300、500、1000和3000bp序列，分别实现了0.93、0.95、0.97和0.98的曲线下面积（AUROC）。用来自宏病毒组样本的数百万条纯化病毒序列扩大训练数据，进一步提高了识别代表性不足的病毒组的准确性。将DeepVirFinder应用于真实的人类肠道宏基因组样本，我们在结直肠癌（CRC）患者中鉴定出属于175个分类单元的51,138条病毒序列。发现有10个分类单元与癌症状态相关，这表明病毒可能在CRC中起重要作用。

结论

借助深度学习和高通量测序宏基因组数据，DeepVirFinder显著提高了病毒识别的准确性，并将有助于宏基因组学时代的病毒研究。

相似文献

Identifying viruses from metagenomic data using deep learning.利用深度学习从宏基因组数据中识别病毒。

Quant Biol. 2020 Mar;8(1):64-77. doi: 10.1007/s40484-019-0187-4.

VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data.VirFinder：一种新型的基于 k-mer 的工具，用于从组装的宏基因组数据中识别病毒序列。

Microbiome. 2017 Jul 6;5(1):69. doi: 10.1186/s40168-017-0283-5.

RNN-VirSeeker: A Deep Learning Method for Identification of Short Viral Sequences From Metagenomes.RNN-VirSeeker：一种从宏基因组中鉴定短病毒序列的深度学习方法。

IEEE/ACM Trans Comput Biol Bioinform. 2022 May-Jun;19(3):1840-1849. doi: 10.1109/TCBB.2020.3044575. Epub 2022 Jun 3.

Virtifier: a deep learning-based identifier for viral sequences from metagenomes.Virtifier：一种基于深度学习的宏基因组病毒序列标识符。

Bioinformatics. 2022 Feb 7;38(5):1216-1222. doi: 10.1093/bioinformatics/btab845.

Reads Binning Improves the Assembly of Viral Genome Sequences From Metagenomic Samples.读段分箱可改善宏基因组样本中病毒基因组序列的组装。

Front Microbiol. 2021 May 21;12:664560. doi: 10.3389/fmicb.2021.664560. eCollection 2021.

VIBRANT: automated recovery, annotation and curation of microbial viruses, and evaluation of viral community function from genomic sequences.VIBRANT：从基因组序列中自动恢复、注释和培养微生物病毒，并评估病毒群落功能。

Microbiome. 2020 Jun 10;8(1):90. doi: 10.1186/s40168-020-00867-0.

Simulation study and comparative evaluation of viral contiguous sequence identification tools.病毒连续序列识别工具的模拟研究与比较评估

BMC Bioinformatics. 2021 Jun 16;22(1):329. doi: 10.1186/s12859-021-04242-0.

ViBE: a hierarchical BERT model to identify eukaryotic viruses using metagenome sequencing data.ViBE：一种基于层次 BERT 模型的方法，用于利用宏基因组测序数据识别真核病毒。

Brief Bioinform. 2022 Jul 18;23(4). doi: 10.1093/bib/bbac204.

DETIRE: a hybrid deep learning model for identifying viral sequences from metagenomes.DETIRE：一种用于从宏基因组中识别病毒序列的混合深度学习模型。

Front Microbiol. 2023 Jun 16;14:1169791. doi: 10.3389/fmicb.2023.1169791. eCollection 2023.

ViraMiner: Deep learning on raw DNA sequences for identifying viral genomes in human samples.ViraMiner：在原始 DNA 序列上进行深度学习，以鉴定人类样本中的病毒基因组。

PLoS One. 2019 Sep 11;14(9):e0222271. doi: 10.1371/journal.pone.0222271. eCollection 2019.

引用本文的文献

Metagenomic profiling of the insect-specific virome in non-urban mosquitoes (Culicidae: Culicinae) from Colombia's Northern inter-Andean valleys.对来自哥伦比亚安第斯山脉北部山谷的非城市蚊子（蚊科：库蚊亚科）中昆虫特异性病毒组的宏基因组分析。

PLoS One. 2025 Sep 3;20(9):e0331552. doi: 10.1371/journal.pone.0331552. eCollection 2025.

Phage quest: a beginner's guide to explore viral diversity in the prokaryotic world.噬菌体探索：探索原核生物世界中病毒多样性的初学者指南。

Brief Bioinform. 2025 Aug 31;26(5). doi: 10.1093/bib/bbaf449.

Global biogeography of airborne viruses in public transit systems and their host interactions.公共交通系统中空气传播病毒的全球生物地理学及其宿主相互作用。

Microbiome. 2025 Aug 29;13(1):193. doi: 10.1186/s40168-025-02173-z.

Temporal dynamics, microdiversity, and ecological functions of viral communities during cyanobacterial blooms in Lake Taihu.太湖蓝藻水华期间病毒群落的时间动态、微观多样性及生态功能

NPJ Biofilms Microbiomes. 2025 Aug 29;11(1):178. doi: 10.1038/s41522-025-00771-1.

Unique plastisphere viromes with habitat-dependent potential for modulating global methane cycle.具有依赖栖息地调节全球甲烷循环潜力的独特塑料球病毒群落。

Nat Commun. 2025 Aug 29;16(1):8098. doi: 10.1038/s41467-025-63215-6.

Advanced Strategies in Phage Research: Innovations, Applications, and Challenges.噬菌体研究的先进策略：创新、应用与挑战

Microorganisms. 2025 Aug 21;13(8):1960. doi: 10.3390/microorganisms13081960.

The effect of taxonomic, host-dependent features and sample bias on virus host prediction using machine learning and short sequence k-mers.分类学、宿主依赖性特征和样本偏差对使用机器学习和短序列k-mer进行病毒宿主预测的影响。

Sci Rep. 2025 Aug 27;15(1):31592. doi: 10.1038/s41598-025-17123-w.

Impact of diet in shaping gut virome of grain-fed and grass-fed beef cattle revealed by a comparative metagenomic study.一项比较宏基因组学研究揭示了饮食对谷物饲养和草饲肉牛肠道病毒组形成的影响。

Microbiome. 2025 Aug 23;13(1):190. doi: 10.1186/s40168-025-02163-1.

Metagenomic analysis reveals how multiple stressors disrupt virus-host interactions in multi-trophic freshwater mesocosms.宏基因组分析揭示了多种压力源如何破坏多营养级淡水微宇宙中的病毒-宿主相互作用。

Nat Commun. 2025 Aug 21;16(1):7806. doi: 10.1038/s41467-025-63162-2.

NextVir: Enabling classification of tumor-causing viruses with genomic foundation models.NextVir：利用基因组基础模型实现致瘤病毒分类

PLoS Comput Biol. 2025 Aug 21;21(8):e1013360. doi: 10.1371/journal.pcbi.1013360. eCollection 2025 Aug.

本文引用的文献

Predicting enhancer-promoter interaction from genomic sequence with deep neural networks.利用深度神经网络从基因组序列预测增强子-启动子相互作用。

Quant Biol. 2019 Jun;7(2):122-137. doi: 10.1007/s40484-019-0154-0.

PPR-Meta: a tool for identifying phages and plasmids from metagenomic fragments using deep learning.PPR-Meta：一种使用深度学习从宏基因组片段中识别噬菌体和质粒的工具。

Gigascience. 2019 Jun 1;8(6). doi: 10.1093/gigascience/giz066.

The Promises and Pitfalls of Machine Learning for Detecting Viruses in Aquatic Metagenomes.机器学习在水生宏基因组中检测病毒的前景与陷阱

Front Microbiol. 2019 Apr 16;10:806. doi: 10.3389/fmicb.2019.00806. eCollection 2019.

FactorNet: A deep learning framework for predicting cell type specific transcription factor binding from nucleotide-resolution sequential data.FactorNet：一种从核苷酸分辨率序列数据预测细胞类型特异性转录因子结合的深度学习框架。

Methods. 2019 Aug 15;166:40-47. doi: 10.1016/j.ymeth.2019.03.020. Epub 2019 Mar 26.

Mining, analyzing, and integrating viral signals from metagenomic data.从宏基因组数据中挖掘、分析和整合病毒信号。

Microbiome. 2019 Mar 19;7(1):42. doi: 10.1186/s40168-019-0657-y.

The Pfam protein families database in 2019.2019 年 Pfam 蛋白质家族数据库。

Nucleic Acids Res. 2019 Jan 8;47(D1):D427-D432. doi: 10.1093/nar/gky995.

A universal SNP and small-indel variant caller using deep neural networks.使用深度神经网络的通用 SNP 和小插入缺失变体调用器。

Nat Biotechnol. 2018 Nov;36(10):983-987. doi: 10.1038/nbt.4235. Epub 2018 Sep 24.

MARVEL, a Tool for Prediction of Bacteriophage Sequences in Metagenomic Bins.MARVEL，一种用于预测宏基因组分箱中噬菌体序列的工具。

Front Genet. 2018 Aug 7;9:304. doi: 10.3389/fgene.2018.00304. eCollection 2018.

Genome-wide prediction of cis-regulatory regions using supervised deep learning methods.基于监督深度学习方法的全基因组顺式调控区预测。

BMC Bioinformatics. 2018 May 31;19(1):202. doi: 10.1186/s12859-018-2187-1.

DeFine: deep convolutional neural networks accurately quantify intensities of transcription factor-DNA binding and facilitate evaluation of functional non-coding variants.DeFine：深度卷积神经网络能够准确量化转录因子-DNA 结合强度，并有助于评估功能非编码变体。

Nucleic Acids Res. 2018 Jun 20;46(11):e69. doi: 10.1093/nar/gky215.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验