Center for Bioinformatics and Computational Biology, Institute for Advanced Computer Studies, College Park, MD 20742, USA.
BMC Bioinformatics. 2010 Nov 2;11:544. doi: 10.1186/1471-2105-11-544.
BACKGROUND: Sequencing of environmental DNA (often called metagenomics) has shown tremendous potential to uncover the vast number of unknown microbes that cannot be cultured and sequenced by traditional methods. Because the output from metagenomic sequencing is a large set of reads of unknown origin, clustering reads together that were sequenced from the same species is a crucial analysis step. Many effective approaches to this task rely on sequenced genomes in public databases, but these genomes are a highly biased sample that is not necessarily representative of environments interesting to many metagenomics projects. RESULTS: We present SCIMM (Sequence Clustering with Interpolated Markov Models), an unsupervised sequence clustering method. SCIMM achieves greater clustering accuracy than previous unsupervised approaches. We examine the limitations of unsupervised learning on complex datasets, and suggest a hybrid of SCIMM and supervised learning method Phymm called PHYSCIMM that performs better when evolutionarily close training genomes are available. CONCLUSIONS: SCIMM and PHYSCIMM are highly accurate methods to cluster metagenomic sequences. SCIMM operates entirely unsupervised, making it ideal for environments containing mostly novel microbes. PHYSCIMM uses supervised learning to improve clustering in environments containing microbial strains from well-characterized genera. SCIMM and PHYSCIMM are available open source from http://www.cbcb.umd.edu/software/scimm.
背景:环境 DNA 测序(通常称为宏基因组学)具有揭示大量无法通过传统方法培养和测序的未知微生物的巨大潜力。由于宏基因组测序的输出是一组未知来源的大量读取序列,因此将来自同一物种的测序读取序列聚类在一起是至关重要的分析步骤。许多有效的方法依赖于公共数据库中的测序基因组,但这些基因组是一个高度偏向的样本,不一定能代表许多宏基因组学项目感兴趣的环境。
结果:我们提出了 SCIMM(基于插值马尔可夫模型的序列聚类),这是一种无监督的序列聚类方法。SCIMM 实现了比以前的无监督方法更高的聚类准确性。我们研究了无监督学习在复杂数据集上的局限性,并提出了一种 SCIMM 和监督学习方法 Phymm 的混合方法 PHYSCIMM,当有进化上接近的训练基因组时,它的性能更好。
结论:SCIMM 和 PHYSCIMM 是高度准确的宏基因组序列聚类方法。SCIMM 完全无监督,非常适合主要包含新型微生物的环境。PHYSCIMM 使用监督学习来提高在包含特征明确属的微生物菌株的环境中的聚类效果。SCIMM 和 PHYSCIMM 可从 http://www.cbcb.umd.edu/software/scimm 获得开源。
BMC Bioinformatics. 2010-11-2
Interdiscip Sci. 2022-12
IEEE/ACM Trans Comput Biol Bioinform. 2014
BMC Bioinformatics. 2015-2-5
BMC Bioinformatics. 2020-7-28
PLoS One. 2011-11-23
Brief Bioinform. 2024-7-25
Curr Genomics. 2022-6-10
Funct Integr Genomics. 2022-2
Comput Struct Biotechnol J. 2021-5-21
Algorithms Mol Biol. 2021-5-4
Appl Plant Sci. 2020-7-31
J Comput Biol. 2011-3
PLoS One. 2010-4-16
BMC Bioinformatics. 2010-3-24
ISME J. 2010-2-11
Nature. 2009-12-24
BMC Bioinformatics. 2009-12-18