Department of Biological Sciences, Virginia Tech, Blacksburg, VA 24061, USA.
Viruses. 2021 Jan 20;13(2):150. doi: 10.3390/v13020150.
Giant viruses are widespread in the biosphere and play important roles in biogeochemical cycling and host genome evolution. Also known as nucleo-cytoplasmic large DNA viruses (NCLDVs), these eukaryotic viruses harbor the largest and most complex viral genomes known. Studies have shown that NCLDVs are frequently abundant in metagenomic datasets, and that sequences derived from these viruses can also be found endogenized in diverse eukaryotic genomes. The accurate detection of sequences derived from NCLDVs is therefore of great importance, but this task is challenging owing to both the high level of sequence divergence between NCLDV families and the extraordinarily high diversity of genes encoded in their genomes, including some encoding for metabolic or translation-related functions that are typically found only in cellular lineages. Here, we present ViralRecall, a bioinformatic tool for the identification of NCLDV signatures in 'omic data. This tool leverages a library of giant virus orthologous groups (GVOGs) to identify sequences that bear signatures of NCLDVs. We demonstrate that this tool can effectively identify NCLDV sequences with high sensitivity and specificity. Moreover, we show that it can be useful both for removing contaminating sequences in metagenome-assembled viral genomes as well as the identification of eukaryotic genomic loci that derived from NCLDV. ViralRecall is written in Python 3.5 and is freely available on GitHub: https://github.com/faylward/viralrecall.
巨型病毒广泛存在于生物圈中,在生物地球化学循环和宿主基因组进化中发挥着重要作用。这些病毒也被称为核质大 DNA 病毒(NCLDVs),它们拥有已知最大和最复杂的病毒基因组。研究表明,NCLDVs 在宏基因组数据集经常是丰富的,而且这些病毒的序列也可以在内源性存在于各种真核生物基因组中。因此,准确检测来自 NCLDVs 的序列非常重要,但由于 NCLDV 家族之间的序列差异很大,以及其基因组中编码的基因极其多样化,包括一些编码代谢或翻译相关功能的基因,这些功能通常只存在于细胞谱系中,因此这项任务具有挑战性。在这里,我们提出了 ViralRecall,这是一种用于在“omic”数据中识别 NCLDV 特征的生物信息学工具。该工具利用了一组巨型病毒直系同源群(GVOGs)来识别具有 NCLDV 特征的序列。我们证明了该工具可以有效地以高灵敏度和特异性识别 NCLDV 序列。此外,我们还表明,它既可以用于去除宏基因组组装病毒基因组中的污染序列,也可以用于鉴定源自 NCLDV 的真核生物基因组位点。ViralRecall 是用 Python 3.5 编写的,可以在 GitHub 上免费获得:https://github.com/faylward/viralrecall。