Molecular and Computational Biology Program, University of Southern California, 1050 Childs Way, Los Angeles, CA, 90089, USA.
Department of Biological Sciences, University of Southern California, 3616 Trousdale Pkwy, Los Angeles, CA, 90089, USA.
Microbiome. 2017 Jul 6;5(1):69. doi: 10.1186/s40168-017-0283-5.
BACKGROUND: Identifying viral sequences in mixed metagenomes containing both viral and host contigs is a critical first step in analyzing the viral component of samples. Current tools for distinguishing prokaryotic virus and host contigs primarily use gene-based similarity approaches. Such approaches can significantly limit results especially for short contigs that have few predicted proteins or lack proteins with similarity to previously known viruses. METHODS: We have developed VirFinder, the first k-mer frequency based, machine learning method for virus contig identification that entirely avoids gene-based similarity searches. VirFinder instead identifies viral sequences based on our empirical observation that viruses and hosts have discernibly different k-mer signatures. VirFinder's performance in correctly identifying viral sequences was tested by training its machine learning model on sequences from host and viral genomes sequenced before 1 January 2014 and evaluating on sequences obtained after 1 January 2014. RESULTS: VirFinder had significantly better rates of identifying true viral contigs (true positive rates (TPRs)) than VirSorter, the current state-of-the-art gene-based virus classification tool, when evaluated with either contigs subsampled from complete genomes or assembled from a simulated human gut metagenome. For example, for contigs subsampled from complete genomes, VirFinder had 78-, 2.4-, and 1.8-fold higher TPRs than VirSorter for 1, 3, and 5 kb contigs, respectively, at the same false positive rates as VirSorter (0, 0.003, and 0.006, respectively), thus VirFinder works considerably better for small contigs than VirSorter. VirFinder furthermore identified several recently sequenced virus genomes (after 1 January 2014) that VirSorter did not and that have no nucleotide similarity to previously sequenced viruses, demonstrating VirFinder's potential advantage in identifying novel viral sequences. Application of VirFinder to a set of human gut metagenomes from healthy and liver cirrhosis patients reveals higher viral diversity in healthy individuals than cirrhosis patients. We also identified contig bins containing crAssphage-like contigs with higher abundance in healthy patients and a putative Veillonella genus prophage associated with cirrhosis patients. CONCLUSIONS: This innovative k-mer based tool complements gene-based approaches and will significantly improve prokaryotic viral sequence identification, especially for metagenomic-based studies of viral ecology.
背景:在包含病毒和宿主基因的混合宏基因组中识别病毒序列是分析样本病毒成分的关键第一步。当前用于区分原核病毒和宿主基因的工具主要使用基于基因相似性的方法。这种方法可能会显著限制结果,尤其是对于短基因而言,这些短基因的预测蛋白较少或缺乏与先前已知病毒相似的蛋白。
方法:我们开发了 VirFinder,这是第一个基于 k-mer 频率的、用于病毒基因识别的机器学习方法,它完全避免了基于基因相似性的搜索。VirFinder 基于我们的经验观察来识别病毒序列,即病毒和宿主具有明显不同的 k-mer 特征。通过在 2014 年 1 月 1 日之前测序的宿主和病毒基因组的序列上训练其机器学习模型,并在 2014 年 1 月 1 日之后获得的序列上进行评估,来测试 VirFinder 正确识别病毒序列的性能。
结果:与当前最先进的基于基因的病毒分类工具 VirSorter 相比,当使用从完整基因组中提取的或从模拟人类肠道宏基因组组装的基因进行评估时,VirFinder 能够更准确地识别真正的病毒基因(真阳性率 (TPR))。例如,对于从完整基因组中提取的基因,VirFinder 在 1、3 和 5 kb 基因的 TPR 分别比 VirSorter 高 78、2.4 和 1.8 倍,而假阳性率与 VirSorter 相同(分别为 0、0.003 和 0.006),因此,VirFinder 对小基因的效果明显优于 VirSorter。VirFinder 还鉴定了一些最近测序的病毒基因组(2014 年 1 月之后),而这些病毒基因组在 VirSorter 中无法识别,并且与之前测序的病毒没有核苷酸相似性,这表明 VirFinder 在鉴定新病毒序列方面具有潜在优势。将 VirFinder 应用于一组来自健康和肝硬化患者的人类肠道宏基因组中,发现健康个体的病毒多样性高于肝硬化患者。我们还鉴定了含有 crAssphage 样基因的基因库,这些基因在健康患者中的丰度更高,以及与肝硬化患者相关的假定韦荣球菌属噬菌体。
结论:这种创新的基于 k-mer 的工具补充了基于基因的方法,将极大地提高原核病毒序列的识别能力,尤其是在病毒生态的宏基因组学研究方面。
Microbiome. 2019-3-19
BMC Bioinformatics. 2016-1-16
Front Microbiol. 2019-4-16
BMC Bioinformatics. 2021-6-16
Front Microbiol. 2021-5-21
BMC Genomics. 2017-11-28
PeerJ. 2015-5-28
IEEE/ACM Trans Comput Biol Bioinform. 2016-6-7
Brief Bioinform. 2025-8-31
Res Sq. 2025-8-19
bioRxiv. 2025-8-11
Brief Funct Genomics. 2025-1-15
NPJ Biofilms Microbiomes. 2025-7-26
Genome Res. 2017-5
Genome Res. 2016-12
Nucleic Acids Res. 2016-7-8
FEMS Microbiol Lett. 2016-5
BMC Genomics. 2016-3-1