Suppr超能文献

MArVD2:一种用于在病毒数据集中区分古菌病毒和细菌病毒的机器学习增强工具。

MArVD2: a machine learning enhanced tool to discriminate between archaeal and bacterial viruses in viral datasets.

作者信息

Vik Dean, Bolduc Benjamin, Roux Simon, Sun Christine L, Pratama Akbar Adjie, Krupovic Mart, Sullivan Matthew B

机构信息

Department of Microbiology, The Ohio State University, Columbus, OH, 43210, USA.

Center of Microbiome Science, The Ohio State University, Columbus, OH, USA.

出版信息

ISME Commun. 2023 Aug 24;3(1):87. doi: 10.1038/s43705-023-00295-9.

Abstract

Our knowledge of viral sequence space has exploded with advancing sequencing technologies and large-scale sampling and analytical efforts. Though archaea are important and abundant prokaryotes in many systems, our knowledge of archaeal viruses outside of extreme environments is limited. This largely stems from the lack of a robust, high-throughput, and systematic way to distinguish between bacterial and archaeal viruses in datasets of curated viruses. Here we upgrade our prior text-based tool (MArVD) via training and testing a random forest machine learning algorithm against a newly curated dataset of archaeal viruses. After optimization, MArVD2 presented a significant improvement over its predecessor in terms of scalability, usability, and flexibility, and will allow user-defined custom training datasets as archaeal virus discovery progresses. Benchmarking showed that a model trained with viral sequences from the hypersaline, marine, and hot spring environments correctly classified 85% of the archaeal viruses with a false detection rate below 2% using a random forest prediction threshold of 80% in a separate benchmarking dataset from the same habitats.

摘要

随着测序技术的进步以及大规模采样和分析工作的开展,我们对病毒序列空间的了解呈爆发式增长。尽管古菌在许多系统中是重要且丰富的原核生物,但我们对极端环境之外的古菌病毒的了解有限。这在很大程度上源于在经过整理的病毒数据集中缺乏一种强大、高通量且系统的方法来区分细菌病毒和古菌病毒。在此,我们通过针对一个新整理的古菌病毒数据集训练和测试随机森林机器学习算法,对我们之前基于文本的工具(MArVD)进行了升级。经过优化后,MArVD2在可扩展性、可用性和灵活性方面比其前身有了显著改进,并且随着古菌病毒发现工作的推进,将允许用户定义自定义训练数据集。基准测试表明,在来自相同栖息地的单独基准测试数据集中,使用80%的随机森林预测阈值,用来自高盐、海洋和温泉环境的病毒序列训练的模型能够正确分类85%的古菌病毒,错误检测率低于2%。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fda0/10449787/a85e325f4267/43705_2023_295_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验