Suppr超能文献

基于组成和同源性对新谱系的短基因组片段进行分类。

Classifying short genomic fragments from novel lineages using composition and homology.

机构信息

Faculty of Computer Science, Dalhousie University, 6050 University Avenue, Halifax, Nova Scotia, Canada.

出版信息

BMC Bioinformatics. 2011 Aug 9;12:328. doi: 10.1186/1471-2105-12-328.

Abstract

BACKGROUND

The assignment of taxonomic attributions to DNA fragments recovered directly from the environment is a vital step in metagenomic data analysis. Assignments can be made using rank-specific classifiers, which assign reads to taxonomic labels from a predetermined level such as named species or strain, or rank-flexible classifiers, which choose an appropriate taxonomic rank for each sequence in a data set. The choice of rank typically depends on the optimal model for a given sequence and on the breadth of taxonomic groups seen in a set of close-to-optimal models. Homology-based (e.g., LCA) and composition-based (e.g., PhyloPythia, TACOA) rank-flexible classifiers have been proposed, but there is at present no hybrid approach that utilizes both homology and composition.

RESULTS

We first develop a hybrid, rank-specific classifier based on BLAST and Naïve Bayes (NB) that has comparable accuracy and a faster running time than the current best approach, PhymmBL. By substituting LCA for BLAST or allowing the inclusion of suboptimal NB models, we obtain a rank-flexible classifier. This hybrid classifier outperforms established rank-flexible approaches on simulated metagenomic fragments of length 200 bp to 1000 bp and is able to assign taxonomic attributions to a subset of sequences with few misclassifications. We then demonstrate the performance of different classifiers on an enhanced biological phosphorous removal metagenome, illustrating the advantages of rank-flexible classifiers when representative genomes are absent from the set of reference genomes. Application to a glacier ice metagenome demonstrates that similar taxonomic profiles are obtained across a set of classifiers which are increasingly conservative in their classification.

CONCLUSIONS

Our NB-based classification scheme is faster than the current best composition-based algorithm, Phymm, while providing equally accurate predictions. The rank-flexible variant of NB, which we term ε-NB, is complementary to LCA and can be combined with it to yield conservative prediction sets of very high confidence. The simple parameterization of LCA and ε-NB allows for tuning of the balance between more predictions and increased precision, allowing the user to account for the sensitivity of downstream analyses to misclassified or unclassified sequences.

摘要

背景

将从环境中直接回收的 DNA 片段分配给分类属性是宏基因组数据分析的重要步骤。可以使用特定等级的分类器进行分配,这些分类器将读取内容分配给预定义级别(例如命名物种或菌株)的分类标签,或者使用等级灵活的分类器,为数据集的每个序列选择适当的分类等级。等级的选择通常取决于给定序列的最佳模型以及一组接近最佳模型中看到的分类群的广度。已经提出了基于同源性(例如 LCA)和基于组成(例如 PhyloPythia、TACOA)的等级灵活分类器,但目前没有利用同源性和组成的混合方法。

结果

我们首先开发了一种基于 BLAST 和朴素贝叶斯 (NB) 的混合、特定等级分类器,该分类器的准确性与当前最佳方法 PhymmBL 相当,运行时间也更快。通过用 LCA 替代 BLAST 或允许包含次优的 NB 模型,我们获得了一个等级灵活的分类器。与已建立的等级灵活方法相比,该混合分类器在长度为 200bp 到 1000bp 的模拟宏基因组片段上表现更好,并且能够对具有少量误分类的序列子集进行分类属性分配。然后,我们在增强的生物磷去除宏基因组上演示了不同分类器的性能,说明了在参考基因组集中不存在代表性基因组时,等级灵活分类器的优势。在冰川冰宏基因组上的应用表明,在一组越来越保守的分类器中,可以获得相似的分类谱。

结论

我们基于 NB 的分类方案比当前基于组成的最佳算法 Phymm 更快,同时提供了同样准确的预测。我们将 NB 的等级灵活变体称为 ε-NB,它与 LCA 互补,可以将其与 LCA 结合使用,生成置信度非常高的保守预测集。LCA 和 ε-NB 的简单参数化允许在更多预测和更高精度之间进行调整,使用户能够考虑下游分析对误分类或未分类序列的敏感性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/85d8/3173459/71a7ec297e98/1471-2105-12-328-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验