Department of Electrical Engineering, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong, China SAR.
Brief Bioinform. 2022 Mar 10;23(2). doi: 10.1093/bib/bbac011.
With advances in library construction protocols and next-generation sequencing technologies, viral metagenomic sequencing has become the major source for novel virus discovery. Conducting taxonomic classification for metagenomic data is an important means to characterize the viral composition in the underlying samples. However, RNA viruses are abundant and highly diverse, jeopardizing the sensitivity of comparison-based classification methods. To improve the sensitivity of read-level taxonomic classification, we developed an RNA-dependent RNA polymerase (RdRp) gene-based read classification tool RdRpBin. It combines alignment-based strategy with machine learning models in order to fully exploit the sequence properties of RdRp. We tested our method and compared its performance with the state-of-the-art tools on the simulated and real sequencing data. RdRpBin competes favorably with all. In particular, when the query RNA viruses share low sequence similarity with the known viruses ($\sim 0.4$), our tool can still maintain a higher F-score than the state-of-the-art tools. The experimental results on real data also showed that RdRpBin can classify more RNA viral reads with a relatively low false-positive rate. Thus, RdRpBin can be utilized to classify novel and diverged RNA viruses.
随着文库构建方案和下一代测序技术的进步,病毒宏基因组测序已成为发现新病毒的主要来源。对宏基因组数据进行分类学分类是描述潜在样本中病毒组成的重要手段。然而,RNA 病毒丰富且高度多样化,这危及了基于比较的分类方法的灵敏度。为了提高读级分类的灵敏度,我们开发了一种基于 RNA 依赖性 RNA 聚合酶(RdRp)基因的读分类工具 RdRpBin。它将基于比对的策略与机器学习模型相结合,以充分利用 RdRp 的序列特性。我们在模拟和真实测序数据上测试了我们的方法,并将其性能与最先进的工具进行了比较。RdRpBin 与所有工具竞争都很激烈。特别是当查询的 RNA 病毒与已知病毒的序列相似性较低($\sim 0.4$)时,我们的工具仍能保持比最先进的工具更高的 F 分数。真实数据上的实验结果还表明,RdRpBin 可以以相对较低的假阳性率分类更多的 RNA 病毒读。因此,RdRpBin 可用于分类新型和分化的 RNA 病毒。