Suppr超能文献

基于内插马尔可夫模型的宏基因组序列聚类。

Clustering metagenomic sequences with interpolated Markov models.

机构信息

Center for Bioinformatics and Computational Biology, Institute for Advanced Computer Studies, College Park, MD 20742, USA.

出版信息

BMC Bioinformatics. 2010 Nov 2;11:544. doi: 10.1186/1471-2105-11-544.

Abstract

BACKGROUND

Sequencing of environmental DNA (often called metagenomics) has shown tremendous potential to uncover the vast number of unknown microbes that cannot be cultured and sequenced by traditional methods. Because the output from metagenomic sequencing is a large set of reads of unknown origin, clustering reads together that were sequenced from the same species is a crucial analysis step. Many effective approaches to this task rely on sequenced genomes in public databases, but these genomes are a highly biased sample that is not necessarily representative of environments interesting to many metagenomics projects.

RESULTS

We present SCIMM (Sequence Clustering with Interpolated Markov Models), an unsupervised sequence clustering method. SCIMM achieves greater clustering accuracy than previous unsupervised approaches. We examine the limitations of unsupervised learning on complex datasets, and suggest a hybrid of SCIMM and supervised learning method Phymm called PHYSCIMM that performs better when evolutionarily close training genomes are available.

CONCLUSIONS

SCIMM and PHYSCIMM are highly accurate methods to cluster metagenomic sequences. SCIMM operates entirely unsupervised, making it ideal for environments containing mostly novel microbes. PHYSCIMM uses supervised learning to improve clustering in environments containing microbial strains from well-characterized genera. SCIMM and PHYSCIMM are available open source from http://www.cbcb.umd.edu/software/scimm.

摘要

背景

环境 DNA 测序(通常称为宏基因组学)具有揭示大量无法通过传统方法培养和测序的未知微生物的巨大潜力。由于宏基因组测序的输出是一组未知来源的大量读取序列,因此将来自同一物种的测序读取序列聚类在一起是至关重要的分析步骤。许多有效的方法依赖于公共数据库中的测序基因组,但这些基因组是一个高度偏向的样本,不一定能代表许多宏基因组学项目感兴趣的环境。

结果

我们提出了 SCIMM(基于插值马尔可夫模型的序列聚类),这是一种无监督的序列聚类方法。SCIMM 实现了比以前的无监督方法更高的聚类准确性。我们研究了无监督学习在复杂数据集上的局限性,并提出了一种 SCIMM 和监督学习方法 Phymm 的混合方法 PHYSCIMM,当有进化上接近的训练基因组时,它的性能更好。

结论

SCIMM 和 PHYSCIMM 是高度准确的宏基因组序列聚类方法。SCIMM 完全无监督,非常适合主要包含新型微生物的环境。PHYSCIMM 使用监督学习来提高在包含特征明确属的微生物菌株的环境中的聚类效果。SCIMM 和 PHYSCIMM 可从 http://www.cbcb.umd.edu/software/scimm 获得开源。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ba4a/3098094/b53bb8af963f/1471-2105-11-544-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验