Vik Dean, Bolduc Benjamin, Roux Simon, Sun Christine L, Pratama Akbar Adjie, Krupovic Mart, Sullivan Matthew B
Department of Microbiology, The Ohio State University, Columbus, OH, 43210, USA.
Center of Microbiome Science, The Ohio State University, Columbus, OH, USA.
ISME Commun. 2023 Aug 24;3(1):87. doi: 10.1038/s43705-023-00295-9.
Our knowledge of viral sequence space has exploded with advancing sequencing technologies and large-scale sampling and analytical efforts. Though archaea are important and abundant prokaryotes in many systems, our knowledge of archaeal viruses outside of extreme environments is limited. This largely stems from the lack of a robust, high-throughput, and systematic way to distinguish between bacterial and archaeal viruses in datasets of curated viruses. Here we upgrade our prior text-based tool (MArVD) via training and testing a random forest machine learning algorithm against a newly curated dataset of archaeal viruses. After optimization, MArVD2 presented a significant improvement over its predecessor in terms of scalability, usability, and flexibility, and will allow user-defined custom training datasets as archaeal virus discovery progresses. Benchmarking showed that a model trained with viral sequences from the hypersaline, marine, and hot spring environments correctly classified 85% of the archaeal viruses with a false detection rate below 2% using a random forest prediction threshold of 80% in a separate benchmarking dataset from the same habitats.
随着测序技术的进步以及大规模采样和分析工作的开展,我们对病毒序列空间的了解呈爆发式增长。尽管古菌在许多系统中是重要且丰富的原核生物,但我们对极端环境之外的古菌病毒的了解有限。这在很大程度上源于在经过整理的病毒数据集中缺乏一种强大、高通量且系统的方法来区分细菌病毒和古菌病毒。在此,我们通过针对一个新整理的古菌病毒数据集训练和测试随机森林机器学习算法,对我们之前基于文本的工具(MArVD)进行了升级。经过优化后,MArVD2在可扩展性、可用性和灵活性方面比其前身有了显著改进,并且随着古菌病毒发现工作的推进,将允许用户定义自定义训练数据集。基准测试表明,在来自相同栖息地的单独基准测试数据集中,使用80%的随机森林预测阈值,用来自高盐、海洋和温泉环境的病毒序列训练的模型能够正确分类85%的古菌病毒,错误检测率低于2%。