School of Public Health and Community Medicine, University of New South Wales, Randwick, NSW 2052, Australia.
Optimisation and Pattern Recognition Research Group, Department of Mechanical Engineering, Melbourne School of Engineering, The University of Melbourne, Parkville, VIC 3010, Australia.
BMC Bioinformatics. 2019 Feb 4;19(Suppl 13):377. doi: 10.1186/s12859-018-2398-5.
Estimating the parameters that describe the ecology of viruses,particularly those that are novel, can be made possible using metagenomic approaches. However, the best-performing existing methods require databases to first estimate an average genome length of a viral community before being able to estimate other parameters, such as viral richness. Although this approach has been widely used, it can adversely skew results since the majority of viruses are yet to be catalogued in databases.
In this paper, we present ENVirT, a method for estimating the richness of novel viral mixtures, and for the first time we also show that it is possible to simultaneously estimate the average genome length without a priori information. This is shown to be a significant improvement over database-dependent methods, since we can now robustly analyze samples that may include novel viral types under-represented in current databases. We demonstrate that the viral richness estimates produced by ENVirT are several orders of magnitude higher in accuracy than the estimates produced by existing methods named PHACCS and CatchAll when benchmarked against simulated data. We repeated the analysis of 20 metavirome samples using ENVirT, which produced results in close agreement with complementary in virto analyses.
These insights were previously not captured by existing computational methods. As such, ENVirT is shown to be an essential tool for enhancing our understanding of novel viral populations.
使用宏基因组学方法可以估计描述病毒生态学的参数,特别是那些新型病毒的参数。然而,现有的性能最佳的方法需要数据库来首先估计病毒群落的平均基因组长度,然后才能估计其他参数,如病毒丰富度。尽管这种方法已经被广泛使用,但由于大多数病毒尚未在数据库中进行编目,它可能会对结果产生不利影响。
在本文中,我们提出了 ENVirT,一种用于估计新型病毒混合物丰富度的方法,并且首次展示了它有可能在没有先验信息的情况下同时估计平均基因组长度。这与依赖数据库的方法相比是一个显著的改进,因为我们现在可以稳健地分析可能包含在当前数据库中代表性不足的新型病毒的样本。我们证明,ENVirT 产生的病毒丰富度估计比名为 PHACCS 和 CatchAll 的现有方法在基准模拟数据上的估计要准确几个数量级。我们使用 ENVirT 重复了对 20 个宏病毒组样本的分析,结果与互补的体内分析结果非常吻合。
这些见解以前没有被现有的计算方法所捕捉到。因此,ENVirT 被证明是增强我们对新型病毒群体理解的重要工具。