• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

病毒发现的信息学方法的基准测试:在组合鉴定方法时需要谨慎。

Benchmarking informatics approaches for virus discovery: caution is needed when combining identification methods.

机构信息

Department of Civil and Environmental Engineering, Case Western Reserve University, Cleveland, Ohio, USA.

Department of Microbiology, The Ohio State University, Columbus, Ohio, USA.

出版信息

mSystems. 2024 Mar 19;9(3):e0110523. doi: 10.1128/msystems.01105-23. Epub 2024 Feb 20.

DOI:10.1128/msystems.01105-23
PMID:38376167
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10949488/
Abstract

UNLABELLED

Understanding the ecological impacts of viruses on natural and engineered ecosystems relies on the accurate identification of viral sequences from community sequencing data. To maximize viral recovery from metagenomes, researchers frequently combine viral identification tools. However, the effectiveness of this strategy is unknown. Here, we benchmarked combinations of six widely used informatics tools for viral identification and analysis (VirSorter, VirSorter2, VIBRANT, DeepVirFinder, CheckV, and Kaiju), called "rulesets." Rulesets were tested against mock metagenomes composed of taxonomically diverse sequence types and diverse aquatic metagenomes to assess the effects of the degree of viral enrichment and habitat on tool performance. We found that six rulesets achieved equivalent accuracy [Matthews Correlation Coefficient (MCC) = 0.77, ≥ 0.05]. Each contained VirSorter2, and five used our "tuning removal" rule designed to remove non-viral contamination. While DeepVirFinder, VIBRANT, and VirSorter were each found once in these high-accuracy rulesets, they were not found in combination with each other: combining tools does not lead to optimal performance. Our validation suggests that the MCC plateau at 0.77 is partly caused by inaccurate labeling within reference sequence databases. In aquatic metagenomes, our highest MCC ruleset identified more viral sequences in virus-enriched (44%-46%) than in cellular metagenomes (7%-19%). While improved algorithms may lead to more accurate viral identification tools, this should be done in tandem with careful curation of sequence databases. We recommend using the VirSorter2 ruleset and our empirically derived tuning removal rule. Our analysis provides insight into methods for viral identification and will enable more robust viral identification from metagenomic data sets.

IMPORTANCE

The identification of viruses from environmental metagenomes using informatics tools has offered critical insights in microbial ecology. However, it remains difficult for researchers to know which tools optimize viral recovery for their specific study. In an attempt to recover more viruses, studies are increasingly combining the outputs from multiple tools without validating this approach. After benchmarking combinations of six viral identification tools against mock metagenomes and environmental samples, we found that these tools should only be combined cautiously. Two to four tool combinations maximized viral recovery and minimized non-viral contamination compared with either the single-tool or the five- to six-tool ones. By providing a rigorous overview of the behavior of viral identification strategies and a pipeline to replicate our process, our findings guide the use of existing viral identification tools and offer a blueprint for feature engineering of new tools that will lead to higher-confidence viral discovery in microbiome studies.

摘要

未加标签

理解病毒对自然和工程生态系统的生态影响,依赖于从群落测序数据中准确识别病毒序列。为了从宏基因组中最大限度地回收病毒,研究人员经常结合使用病毒识别工具。然而,这种策略的有效性尚不清楚。在这里,我们对六种广泛使用的病毒识别和分析信息学工具(VirSorter、VirSorter2、VIBRANT、DeepVirFinder、CheckV 和 Kaiju)的组合(称为“规则集”)进行了基准测试。规则集针对由分类多样的序列类型和多样的水生宏基因组组成的模拟宏基因组进行了测试,以评估病毒富集程度和生境对工具性能的影响。我们发现,六个规则集的准确性相当[马修斯相关系数(MCC)=0.77,≥0.05]。每个规则集都包含 VirSorter2,并且五个规则集都使用了我们设计的“调整去除”规则,用于去除非病毒污染。虽然 DeepVirFinder、VIBRANT 和 VirSorter 都在这些高精度规则集中出现过,但它们彼此之间并未组合出现:组合工具并不能带来最佳性能。我们的验证表明,MCC 在 0.77 处达到平台期,部分原因是参考序列数据库中的标签不准确。在水生宏基因组中,我们的最高 MCC 规则集在病毒富集(44%-46%)的样本中比在细胞宏基因组(7%-19%)中识别出更多的病毒序列。虽然改进的算法可能会导致更准确的病毒识别工具,但这应该与序列数据库的仔细管理同时进行。我们建议使用 VirSorter2 规则集和我们经验性推导的调整去除规则。我们的分析提供了对病毒识别方法的深入了解,并将使从宏基因组数据集进行更稳健的病毒识别成为可能。

重要性

使用信息学工具从环境宏基因组中鉴定病毒为微生物生态学提供了关键的见解。然而,研究人员仍然难以确定哪些工具最适合他们特定的研究来优化病毒回收。为了尽可能多地回收病毒,研究人员越来越多地结合使用多种工具的输出,而没有验证这种方法。在对六种病毒识别工具的组合与模拟宏基因组和环境样本进行基准测试后,我们发现这些工具的组合应该谨慎进行。与单个工具或五到六个工具相比,两到四个工具的组合最大限度地提高了病毒的回收,同时最大限度地减少了非病毒污染。通过对病毒识别策略的行为进行严格概述,并提供一个可复制我们流程的管道,我们的研究结果为现有病毒识别工具的使用提供了指导,并为新工具的特征工程提供了蓝图,这将有助于在微生物组研究中更有信心地发现病毒。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0b25/10949488/ee477dcf102b/msystems.01105-23.f008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0b25/10949488/70fc37d93b89/msystems.01105-23.f001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0b25/10949488/a5d0877bb007/msystems.01105-23.f002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0b25/10949488/bb6bab7546be/msystems.01105-23.f003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0b25/10949488/9d9c0f312f58/msystems.01105-23.f004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0b25/10949488/ab637754f6b6/msystems.01105-23.f005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0b25/10949488/d4e6e55d2a16/msystems.01105-23.f006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0b25/10949488/d03da6ac6bd5/msystems.01105-23.f007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0b25/10949488/ee477dcf102b/msystems.01105-23.f008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0b25/10949488/70fc37d93b89/msystems.01105-23.f001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0b25/10949488/a5d0877bb007/msystems.01105-23.f002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0b25/10949488/bb6bab7546be/msystems.01105-23.f003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0b25/10949488/9d9c0f312f58/msystems.01105-23.f004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0b25/10949488/ab637754f6b6/msystems.01105-23.f005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0b25/10949488/d4e6e55d2a16/msystems.01105-23.f006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0b25/10949488/d03da6ac6bd5/msystems.01105-23.f007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0b25/10949488/ee477dcf102b/msystems.01105-23.f008.jpg

相似文献

1
Benchmarking informatics approaches for virus discovery: caution is needed when combining identification methods.病毒发现的信息学方法的基准测试:在组合鉴定方法时需要谨慎。
mSystems. 2024 Mar 19;9(3):e0110523. doi: 10.1128/msystems.01105-23. Epub 2024 Feb 20.
2
Simulation study and comparative evaluation of viral contiguous sequence identification tools.病毒连续序列识别工具的模拟研究与比较评估
BMC Bioinformatics. 2021 Jun 16;22(1):329. doi: 10.1186/s12859-021-04242-0.
3
VIBRANT: automated recovery, annotation and curation of microbial viruses, and evaluation of viral community function from genomic sequences.VIBRANT:从基因组序列中自动恢复、注释和培养微生物病毒,并评估病毒群落功能。
Microbiome. 2020 Jun 10;8(1):90. doi: 10.1186/s40168-020-00867-0.
4
MVP: a modular viromics pipeline to identify, filter, cluster, annotate, and bin viruses from metagenomes.MVP:一个模块化的病毒组学分析流程,用于从宏基因组中识别、过滤、聚类、注释和分类病毒。
mSystems. 2024 Oct 22;9(10):e0088824. doi: 10.1128/msystems.00888-24. Epub 2024 Oct 1.
5
VirSorter: mining viral signal from microbial genomic data.VirSorter:从微生物基因组数据中挖掘病毒信号。
PeerJ. 2015 May 28;3:e985. doi: 10.7717/peerj.985. eCollection 2015.
6
Gauge your phage: benchmarking of bacteriophage identification tools in metagenomic sequencing data.评估噬菌体:宏基因组测序数据中噬菌体鉴定工具的基准测试。
Microbiome. 2023 Apr 21;11(1):84. doi: 10.1186/s40168-023-01533-x.
7
Benchmarking bioinformatic virus identification tools using real-world metagenomic data across biomes.利用跨生物群落的真实世界宏基因组数据对生物信息病毒识别工具进行基准测试。
Genome Biol. 2024 Apr 15;25(1):97. doi: 10.1186/s13059-024-03236-4.
8
VirSorter2: a multi-classifier, expert-guided approach to detect diverse DNA and RNA viruses.VirSorter2:一种用于检测多种DNA和RNA病毒的多分类器、专家指导方法。
Microbiome. 2021 Feb 1;9(1):37. doi: 10.1186/s40168-020-00990-y.
9
TaxiBGC: a Taxonomy-Guided Approach for Profiling Experimentally Characterized Microbial Biosynthetic Gene Clusters and Secondary Metabolite Production Potential in Metagenomes.TaxiBGC:一种基于分类学的方法,用于对宏基因组中实验表征的微生物生物合成基因簇和次生代谢产物产生潜力进行分析。
mSystems. 2022 Dec 20;7(6):e0092522. doi: 10.1128/msystems.00925-22. Epub 2022 Nov 15.
10
Viromes vs. mixed community metagenomes: choice of method dictates interpretation of viral community ecology.病毒组与混合群落宏基因组:方法的选择决定了病毒群落生态学的解释。
Microbiome. 2024 Oct 7;12(1):195. doi: 10.1186/s40168-024-01905-x.

引用本文的文献

1
Phage quest: a beginner's guide to explore viral diversity in the prokaryotic world.噬菌体探索:探索原核生物世界中病毒多样性的初学者指南。
Brief Bioinform. 2025 Aug 31;26(5). doi: 10.1093/bib/bbaf449.
2
Viromics approaches for the study of viral diversity and ecology in microbiomes.用于研究微生物群落中病毒多样性和生态的病毒组学方法。
Nat Rev Genet. 2025 Jul 21. doi: 10.1038/s41576-025-00871-w.
3
VirNucPro: an identifier for the identification of viral short sequences using six-frame translation and large language models.VirNucPro:一种使用六框架翻译和大语言模型来识别病毒短序列的标识符。

本文引用的文献

1
ViWrap: A modular pipeline to identify, bin, classify, and predict viral-host relationships for viruses from metagenomes.ViWrap:一种用于从宏基因组中识别、分类、归类和预测病毒与宿主关系的模块化流程。
Imeta. 2023 Aug;2(3). doi: 10.1002/imt2.118. Epub 2023 Jun 7.
2
MuDoGeR: Multi-Domain Genome recovery from metagenomes made easy.MuDoGeR:轻松从宏基因组中进行多领域基因组恢复。
Mol Ecol Resour. 2024 Feb;24(2):e13904. doi: 10.1111/1755-0998.13904. Epub 2023 Nov 23.
3
Identification of mobile genetic elements with geNomad.使用 geNomad 识别移动遗传元件。
Brief Bioinform. 2025 May 1;26(3). doi: 10.1093/bib/bbaf224.
4
Development of a quantitative metagenomic approach to establish quantitative limits and its application to viruses.建立定量限的定量宏基因组学方法的开发及其在病毒中的应用。
Nucleic Acids Res. 2025 Feb 27;53(5). doi: 10.1093/nar/gkaf118.
5
VirID: Beyond Virus Discovery-An Integrated Platform for Comprehensive RNA Virus Characterization.VirID:超越病毒发现——一个全面的 RNA 病毒特征分析的综合平台。
Mol Biol Evol. 2024 Oct 4;41(10). doi: 10.1093/molbev/msae202.
6
Prokaryotic-virus-encoded auxiliary metabolic genes throughout the global oceans.全球海洋中的原核病毒编码辅助代谢基因。
Microbiome. 2024 Aug 29;12(1):159. doi: 10.1186/s40168-024-01876-z.
7
A panoramic view of the virosphere in three wastewater treatment plants by integrating viral-like particle-concentrated and traditional non-concentrated metagenomic approaches.通过整合病毒样颗粒浓缩和传统非浓缩宏基因组学方法对三个污水处理厂病毒圈的全景观察。
Imeta. 2024 Mar 29;3(3):e188. doi: 10.1002/imt2.188. eCollection 2024 Jun.
Nat Biotechnol. 2024 Aug;42(8):1303-1312. doi: 10.1038/s41587-023-01953-y. Epub 2023 Sep 21.
4
Gauge your phage: benchmarking of bacteriophage identification tools in metagenomic sequencing data.评估噬菌体:宏基因组测序数据中噬菌体鉴定工具的基准测试。
Microbiome. 2023 Apr 21;11(1):84. doi: 10.1186/s40168-023-01533-x.
5
ViroProfiler: a containerized bioinformatics pipeline for viral metagenomic data analysis.ViroProfiler:用于病毒宏基因组数据分析的集装箱化生物信息学管道。
Gut Microbes. 2023 Jan-Dec;15(1):2192522. doi: 10.1080/19490976.2023.2192522.
6
Identifying Eukaryotes and Factors Influencing Their Biogeography in Drinking Water Metagenomes.鉴定饮用水宏基因组中的真核生物及其影响因素的生物地理学分布。
Environ Sci Technol. 2023 Mar 7;57(9):3645-3660. doi: 10.1021/acs.est.2c09010. Epub 2023 Feb 24.
7
Insights into the global freshwater virome.对全球淡水病毒群落的洞察。
Front Microbiol. 2022 Sep 28;13:953500. doi: 10.3389/fmicb.2022.953500. eCollection 2022.
8
MetaPhage: an Automated Pipeline for Analyzing, Annotating, and Classifying Bacteriophages in Metagenomics Sequencing Data.MetaPhage:一个用于分析、注释和分类宏基因组测序数据中噬菌体的自动化管道。
mSystems. 2022 Oct 26;7(5):e0074122. doi: 10.1128/msystems.00741-22. Epub 2022 Sep 7.
9
A snapshot of the global drinking water virome: Diversity and metabolic potential vary with residual disinfectant use.全球饮用水病毒组快照:多样性和代谢潜能随残留消毒剂的使用而变化。
Water Res. 2022 Jun 30;218:118484. doi: 10.1016/j.watres.2022.118484. Epub 2022 Apr 21.
10
Computational Tools for the Analysis of Uncultivated Phage Genomes.用于分析未培养噬菌体基因组的计算工具。
Microbiol Mol Biol Rev. 2022 Jun 15;86(2):e0000421. doi: 10.1128/mmbr.00004-21. Epub 2022 Mar 21.