MArVD2：一种用于在病毒数据集中区分古菌病毒和细菌病毒的机器学习增强工具。

MArVD2: a machine learning enhanced tool to discriminate between archaeal and bacterial viruses in viral datasets.

作者信息

Vik Dean, Bolduc Benjamin, Roux Simon, Sun Christine L, Pratama Akbar Adjie, Krupovic Mart, Sullivan Matthew B

机构信息

Department of Microbiology, The Ohio State University, Columbus, OH, 43210, USA.

Center of Microbiome Science, The Ohio State University, Columbus, OH, USA.

出版信息

ISME Commun. 2023 Aug 24;3(1):87. doi: 10.1038/s43705-023-00295-9.

DOI:10.1038/s43705-023-00295-9

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10449787/

Abstract

Our knowledge of viral sequence space has exploded with advancing sequencing technologies and large-scale sampling and analytical efforts. Though archaea are important and abundant prokaryotes in many systems, our knowledge of archaeal viruses outside of extreme environments is limited. This largely stems from the lack of a robust, high-throughput, and systematic way to distinguish between bacterial and archaeal viruses in datasets of curated viruses. Here we upgrade our prior text-based tool (MArVD) via training and testing a random forest machine learning algorithm against a newly curated dataset of archaeal viruses. After optimization, MArVD2 presented a significant improvement over its predecessor in terms of scalability, usability, and flexibility, and will allow user-defined custom training datasets as archaeal virus discovery progresses. Benchmarking showed that a model trained with viral sequences from the hypersaline, marine, and hot spring environments correctly classified 85% of the archaeal viruses with a false detection rate below 2% using a random forest prediction threshold of 80% in a separate benchmarking dataset from the same habitats.

摘要

随着测序技术的进步以及大规模采样和分析工作的开展，我们对病毒序列空间的了解呈爆发式增长。尽管古菌在许多系统中是重要且丰富的原核生物，但我们对极端环境之外的古菌病毒的了解有限。这在很大程度上源于在经过整理的病毒数据集中缺乏一种强大、高通量且系统的方法来区分细菌病毒和古菌病毒。在此，我们通过针对一个新整理的古菌病毒数据集训练和测试随机森林机器学习算法，对我们之前基于文本的工具（MArVD）进行了升级。经过优化后，MArVD2在可扩展性、可用性和灵活性方面比其前身有了显著改进，并且随着古菌病毒发现工作的推进，将允许用户定义自定义训练数据集。基准测试表明，在来自相同栖息地的单独基准测试数据集中，使用80%的随机森林预测阈值，用来自高盐、海洋和温泉环境的病毒序列训练的模型能够正确分类85%的古菌病毒，错误检测率低于2%。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fda0/10449787/a85e325f4267/43705_2023_295_Fig1_HTML.jpg

相似文献

1

MArVD2: a machine learning enhanced tool to discriminate between archaeal and bacterial viruses in viral datasets.MArVD2：一种用于在病毒数据集中区分古菌病毒和细菌病毒的机器学习增强工具。

ISME Commun. 2023 Aug 24;3(1):87. doi: 10.1038/s43705-023-00295-9.

2

Putative archaeal viruses from the mesopelagic ocean.来自海洋中层的假定古病毒。

PeerJ. 2017 Jun 15;5:e3428. doi: 10.7717/peerj.3428. eCollection 2017.

3

Diversity of putative archaeal RNA viruses in metagenomic datasets of a yellowstone acidic hot spring.黄石酸性温泉宏基因组数据集中假定古菌RNA病毒的多样性

Springerplus. 2015 Apr 18;4:189. doi: 10.1186/s40064-015-0973-z. eCollection 2015.

4

Archaeal Host Cell Recognition and Viral Binding of HFTV1 to Its Host.古菌宿主细胞识别与 HFTV1 对其宿主的病毒结合。

mBio. 2023 Feb 28;14(1):e0183322. doi: 10.1128/mbio.01833-22. Epub 2023 Jan 19.

5

Archaeal Viruses from High-Temperature Environments.来自高温环境的古菌病毒。

Genes (Basel). 2018 Feb 27;9(3):128. doi: 10.3390/genes9030128.

6

Diverse viruses of marine archaea discovered using metagenomics.利用宏基因组学发现海洋古菌的多种病毒。

Environ Microbiol. 2023 Feb;25(2):367-382. doi: 10.1111/1462-2920.16287. Epub 2022 Nov 24.

7

Genome signature analysis of thermal virus metagenomes reveals Archaea and thermophilic signatures.热病毒宏基因组的基因组特征分析揭示了古菌和嗜热特征。

BMC Genomics. 2008 Sep 17;9:420. doi: 10.1186/1471-2164-9-420.

8

Status of the Archaeal and Bacterial Census: an Update.古菌和细菌普查现状：最新情况

mBio. 2016 May 17;7(3):e00201-16. doi: 10.1128/mBio.00201-16.

9

RaFAH: Host prediction for viruses of Bacteria and Archaea based on protein content.RaFAH：基于蛋白质含量对细菌和古菌病毒进行宿主预测。

Patterns (N Y). 2021 Jun 15;2(7):100274. doi: 10.1016/j.patter.2021.100274. eCollection 2021 Jul 9.

10

Viruses of archaea: Structural, functional, environmental and evolutionary genomics.古菌病毒：结构、功能、环境与进化基因组学。

Virus Res. 2018 Jan 15;244:181-193. doi: 10.1016/j.virusres.2017.11.025. Epub 2017 Nov 22.

引用本文的文献

1

Insights Into Phylogeny, Diversity and Functional Potential of Poseidoniales Viruses.对波喜荡目病毒的系统发育、多样性和功能潜力的见解

Environ Microbiol. 2025 Jan;27(1):e70017. doi: 10.1111/1462-2920.70017.

本文引用的文献

1

iVirus 2.0: Cyberinfrastructure-supported tools and data to power DNA virus ecology.iVirus 2.0：支持网络基础设施的工具和数据助力DNA病毒生态学研究

ISME Commun. 2021 Dec 14;1(1):77. doi: 10.1038/s43705-021-00083-3.

2

Diversity, taxonomy, and evolution of archaeal viruses of the class Caudoviricetes.有尾噬菌体目古菌病毒的多样性、分类学及进化

PLoS Biol. 2021 Nov 9;19(11):e3001442. doi: 10.1371/journal.pbio.3001442. eCollection 2021 Nov.

3

Environmental vulnerability of the global ocean epipelagic plankton community interactome.全球海洋上层浮游生物群落相互作用组的环境脆弱性

Sci Adv. 2021 Aug 27;7(35). doi: 10.1126/sciadv.abg1921. Print 2021 Aug.

4

Lytic archaeal viruses infect abundant primary producers in Earth's crust.溶原性古菌病毒感染了地壳中丰富的初级生产者。

Nat Commun. 2021 Jul 30;12(1):4642. doi: 10.1038/s41467-021-24803-4.

5

Expanding standards in viromics: in silico evaluation of dsDNA viral genome identification, classification, and auxiliary metabolic gene curation.病毒组学标准的扩展：双链DNA病毒基因组鉴定、分类和辅助代谢基因管理的计算机模拟评估

PeerJ. 2021 Jun 14;9:e11447. doi: 10.7717/peerj.11447. eCollection 2021.

6

Identifying viruses from metagenomic data using deep learning.利用深度学习从宏基因组数据中识别病毒。

Quant Biol. 2020 Mar;8(1):64-77. doi: 10.1007/s40484-019-0187-4.

7

VirSorter2: a multi-classifier, expert-guided approach to detect diverse DNA and RNA viruses.VirSorter2：一种用于检测多种DNA和RNA病毒的多分类器、专家指导方法。

Microbiome. 2021 Feb 1;9(1):37. doi: 10.1186/s40168-020-00990-y.

8

Potential virus-mediated nitrogen cycling in oxygen-depleted oceanic waters.缺氧海洋水体中潜在的病毒介导氮循环。

ISME J. 2021 Apr;15(4):981-998. doi: 10.1038/s41396-020-00825-6. Epub 2020 Nov 16.

9

Genome-resolved viral ecology in a marine oxygen minimum zone.海洋缺氧区的基因组解析病毒生态学。

Environ Microbiol. 2021 Jun;23(6):2858-2874. doi: 10.1111/1462-2920.15313. Epub 2020 Nov 23.

10

VIRIDIC-A Novel Tool to Calculate the Intergenomic Similarities of Prokaryote-Infecting Viruses.VIRIDIC—一种用于计算原核生物感染病毒之间基因组相似度的新工具。

Viruses. 2020 Nov 6;12(11):1268. doi: 10.3390/v12111268.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验