Garcia Benjamin J, Simha Ramanuja, Garvin Michael, Furches Anna, Jones Piet, Gazolla Joao G F M, Hyatt P Doug, Schadt Christopher W, Pelletier Dale, Jacobson Daniel
Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN, United States.
Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA, United States.
Comput Struct Biotechnol J. 2021 Oct 25;19:5911-5919. doi: 10.1016/j.csbj.2021.10.029. eCollection 2021.
Viruses are an underrepresented taxa in the study and identification of microbiome constituents; however, they play an essential role in health, microbiome regulation, and transfer of genetic material. Only a few thousand viruses have been isolated, sequenced, and assigned a taxonomy, which limits the ability to identify and quantify viruses in the microbiome. Additionally, the vast diversity of viruses represents a challenge for classification, not only in constructing a viral taxonomy, but also in identifying similarities between a virus' genotype and its phenotype. However, the diversity of viral sequences can be leveraged to classify their sequences in metagenomic and metatranscriptomic samples, even if they do not have a taxonomy. To identify and quantify viruses in transcriptomic and genomic samples, we developed a dynamic programming algorithm for creating a classification tree out of 715,672 metagenome viruses. To create the classification tree, we clustered proportional similarity scores generated from the k-mer profiles of each of the metagenome viruses to create a database of metagenomic viruses. The resulting Kraken2 database of the metagenomic viruses can be found here: https://www.osti.gov/biblio/1615774 and is compatible with Kraken2. We then integrated the viral classification database with databases created with genomes from NCBI for use with ParaKraken (a parallelized version of Kraken provided in Supplemental Zip 1), a metagenomic/transcriptomic classifier. To illustrate the breadth of our utility for classifying metagenome viruses, we analyzed data from a plant metagenome study identifying genotypic and compartment specific differences between two genotypes in three different compartments. We also identified a significant increase in abundance of eight viral sequences in post mortem brains in a human metatranscriptome study comparing Autism Spectrum Disorder patients and controls. We also show the potential accuracy for classifying viruses by utilizing both the JGI and NCBI viral databases to identify the uniqueness of viral sequences. Finally, we validate the accuracy of viral classification with NCBI databases containing viruses with taxonomy to identify pathogenic viruses in known COVID-19 and cassava brown streak virus infection samples. Our method represents the compulsory first step in better understanding the role of viruses in the microbiome by allowing for a more complete identification of sequences without taxonomy. Better classification of viruses will improve identifying associations between viruses and their hosts as well as viruses and other microbiome members. Despite the lack of taxonomy, this database of metagenomic viruses can be used with any tool that utilizes a taxonomy, such as Kraken, for accurate classification of viruses.
在微生物组成分的研究和鉴定中,病毒是代表性不足的一类生物;然而,它们在健康、微生物组调节以及遗传物质转移方面发挥着至关重要的作用。目前仅分离、测序并确定了分类地位的病毒只有几千种,这限制了在微生物组中识别和量化病毒的能力。此外,病毒的巨大多样性不仅给构建病毒分类学带来挑战,也给识别病毒基因型与其表型之间的相似性带来挑战。不过,即使病毒没有分类地位,也可以利用病毒序列的多样性对宏基因组和宏转录组样本中的序列进行分类。为了在转录组和基因组样本中识别和量化病毒,我们开发了一种动态规划算法,用于从715,672个宏基因组病毒构建分类树。为了创建分类树,我们对每个宏基因组病毒的k-mer图谱生成的比例相似性得分进行聚类,以创建一个宏基因组病毒数据库。由此得到的宏基因组病毒的Kraken2数据库可在此处找到:https://www.osti.gov/biblio/1615774,并且与Kraken2兼容。然后,我们将病毒分类数据库与使用来自NCBI的基因组创建的数据库集成,以便与ParaKraken(补充压缩包1中提供的Kraken的并行版本)一起使用,ParaKraken是一种宏基因组/转录组分类器。为了说明我们对宏基因组病毒进行分类实用性的广度,我们分析了一项植物宏基因组研究的数据,该研究确定了三种不同区室中两种基因型之间的基因型和区室特异性差异。在一项比较自闭症谱系障碍患者和对照的人类宏转录组研究中,我们还发现死后大脑中八个病毒序列的丰度显著增加。我们还展示了通过利用JGI和NCBI病毒数据库来识别病毒序列的独特性从而对病毒进行分类的潜在准确性。最后,我们使用包含具有分类地位的病毒的NCBI数据库验证病毒分类的准确性,以识别已知的COVID-19和木薯褐色条纹病毒感染样本中的致病病毒。我们的方法通过允许更完整地识别没有分类地位的序列,代表了更好地理解病毒在微生物组中作用的必要第一步。对病毒进行更好的分类将有助于改善识别病毒与其宿主以及病毒与其他微生物组成员之间的关联。尽管缺乏分类地位,但这个宏基因组病毒数据库可与任何利用分类学的工具(如Kraken)一起使用,以准确分类病毒。