Suppr超能文献

利用 RDP 分类器预测分类学新颖性并缩小寻找新生物的搜索空间。

Using the RDP classifier to predict taxonomic novelty and reduce the search space for finding novel organisms.

机构信息

School of Biomedical Engineering, Science, and Health Systems, Drexel University, Philadelphia, Pennsylvania, United States of America.

出版信息

PLoS One. 2012;7(3):e32491. doi: 10.1371/journal.pone.0032491. Epub 2012 Mar 5.

Abstract

BACKGROUND

Currently, the naïve Bayesian classifier provided by the Ribosomal Database Project (RDP) is one of the most widely used tools to classify 16S rRNA sequences, mainly collected from environmental samples. We show that RDP has 97+% assignment accuracy and is fast for 250 bp and longer reads when the read originates from a taxon known to the database. Because most environmental samples will contain organisms from taxa whose 16S rRNA genes have not been previously sequenced, we aim to benchmark how well the RDP classifier and other competing methods can discriminate these novel taxa from known taxa.

PRINCIPAL FINDINGS

Because each fragment is assigned a score (containing likelihood or confidence information such as the boostrap score in the RDP classifier), we "train" a threshold to discriminate between novel and known organisms and observe its performance on a test set. The threshold that we determine tends to be conservative (low sensitivity but high specificity) for naïve Bayesian methods. Nonetheless, our method performs better with the RDP classifier than the other methods tested, measured by the f-measure and the area-under-the-curve on the receiver operating characteristic of the test set. By constraining the database to well-represented genera, sensitivity improves 3-15%. Finally, we show that the detector is a good predictor to determine novel abundant taxa (especially for finer levels of taxonomy where novelty is more likely to be present).

CONCLUSIONS

We conclude that selecting a read-length appropriate RDP bootstrap score can significantly reduce the search space for identifying novel genera and higher levels in taxonomy. In addition, having a well-represented database significantly improves performance while having genera that are "highly" similar does not make a significant improvement. On a real dataset from an Amazon Terra Preta soil sample, we show that the detector can predict (or correlates to) whether novel sequences will be assigned to new taxa when the RDP database "doubles" in the future.

摘要

背景

目前,核糖体数据库项目(RDP)提供的朴素贝叶斯分类器是将 16S rRNA 序列分类为主要从环境样本中收集的最广泛使用的工具之一。我们表明,当读取来自数据库中已知分类单元时,RDP 具有 97%+的分配准确性,并且对于 250bp 及更长的读取速度很快。由于大多数环境样本将包含其 16S rRNA 基因尚未测序的分类单元的生物体,我们旨在基准测试 RDP 分类器和其他竞争方法从已知分类单元区分这些新分类单元的能力。

主要发现

由于每个片段都被分配了一个分数(包含似然或置信信息,如 RDP 分类器中的引导分数),我们“训练”一个阈值来区分新的和已知的生物,并观察其在测试集上的性能。对于朴素贝叶斯方法,我们确定的阈值往往是保守的(敏感性低但特异性高)。尽管如此,我们的方法使用 RDP 分类器的性能优于测试的其他方法,通过测试集的接收者操作特性的 f 测量和曲线下面积来衡量。通过将数据库限制在代表性良好的属中,敏感性提高了 3-15%。最后,我们表明该探测器是确定新的丰富分类单元(特别是在更精细的分类水平,更容易出现新颖性)的良好预测器。

结论

我们得出的结论是,选择适当的 RDP 引导分数的读取长度可以显著减少识别新分类单元和更高分类水平的搜索空间。此外,拥有一个代表性良好的数据库可以显著提高性能,而具有“高度”相似的属则不会有显著的改进。在一个来自亚马逊 Terra Preta 土壤样本的真实数据集上,我们表明,当 RDP 数据库“翻倍”时,探测器可以预测(或相关)新序列是否将被分配到新的分类单元。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b3a0/3293824/5bcf3b9c1e2a/pone.0032491.g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验