Rosen Gail L, Lim Tze Yee
Department of Electrical and Computer Engineering, Drexel University, Philadelphia, PA, USA.
BMC Res Notes. 2012 Jan 31;5:81. doi: 10.1186/1756-0500-5-81.
Classifying the fungal and viral content of a sample is an important component of analyzing microbial communities in environmental media. Therefore, a method to classify any fragment from these organisms' DNA should be implemented.
We update the näive Bayes classification (NBC) tool to classify reads originating from viral and fungal organisms. NBC classifies a fungal dataset similarly to Basic Local Alignment Search Tool (BLAST) and the Ribosomal Database Project (RDP) classifier. We also show NBC's similarities and differences to RDP on a fungal large subunit (LSU) ribosomal DNA dataset. For viruses in the training database, strain classification accuracy is 98%, while for those reads originating from sequences not in the database, the order-level accuracy is 78%, where order indicates the taxonomic level in the tree of life.
In addition to being competitive to other classifiers available, NBC has the potential to handle reads originating from any location in the genome. We recommend using the Bacteria/Archaea, Fungal, and Virus databases separately due to algorithmic biases towards long genomes. The tool is publicly available at: http://nbc.ece.drexel.edu.
对样本中的真菌和病毒成分进行分类是分析环境介质中微生物群落的重要组成部分。因此,应实施一种对这些生物体DNA的任何片段进行分类的方法。
我们更新了朴素贝叶斯分类(NBC)工具,以对源自病毒和真菌生物体的 reads 进行分类。NBC 对真菌数据集的分类与基本局部比对搜索工具(BLAST)和核糖体数据库项目(RDP)分类器类似。我们还展示了 NBC 在真菌大亚基(LSU)核糖体 DNA 数据集上与 RDP 的异同。对于训练数据库中的病毒,菌株分类准确率为 98%,而对于那些源自数据库中不存在序列的 reads,目级准确率为 78%,其中目表示生命之树中的分类级别。
除了与其他可用分类器具有竞争力外,NBC 还有潜力处理源自基因组中任何位置的 reads。由于算法对长基因组存在偏差,我们建议分别使用细菌/古菌、真菌和病毒数据库。该工具可在以下网址公开获取:http://nbc.ece.drexel.edu 。