Department of Computer Science, The University of Hong Kong, Hong Kong.
Bioinformatics. 2011 Jun 1;27(11):1489-95. doi: 10.1093/bioinformatics/btr186. Epub 2011 Apr 14.
MOTIVATION: With the rapid development of next-generation sequencing techniques, metagenomics, also known as environmental genomics, has emerged as an exciting research area that enables us to analyze the microbial environment in which we live. An important step for metagenomic data analysis is the identification and taxonomic characterization of DNA fragments (reads or contigs) resulting from sequencing a sample of mixed species. This step is referred to as 'binning'. Binning algorithms that are based on sequence similarity and sequence composition markers rely heavily on the reference genomes of known microorganisms or phylogenetic markers. Due to the limited availability of reference genomes and the bias and low availability of markers, these algorithms may not be applicable in all cases. Unsupervised binning algorithms which can handle fragments from unknown species provide an alternative approach. However, existing unsupervised binning algorithms only work on datasets either with balanced species abundance ratios or rather different abundance ratios, but not both. RESULTS: In this article, we present MetaCluster 3.0, an integrated binning method based on the unsupervised top--down separation and bottom--up merging strategy, which can bin metagenomic fragments of species with very balanced abundance ratios (say 1:1) to very different abundance ratios (e.g. 1:24) with consistently higher accuracy than existing methods. AVAILABILITY: MetaCluster 3.0 can be downloaded at http://i.cs.hku.hk/~alse/MetaCluster/.
动机:随着下一代测序技术的快速发展,宏基因组学,也称为环境基因组学,已经成为一个令人兴奋的研究领域,使我们能够分析我们生活的微生物环境。宏基因组数据分析的一个重要步骤是识别和分类学特征化来自混合物种样本测序的 DNA 片段(读取或 contigs)。这一步骤称为“分箱”。基于序列相似性和序列组成标记的分箱算法严重依赖于已知微生物或系统发育标记的参考基因组。由于参考基因组的有限可用性以及标记的偏差和低可用性,这些算法可能并不适用于所有情况。可以处理未知物种片段的无监督分箱算法提供了一种替代方法。然而,现有的无监督分箱算法仅适用于具有平衡物种丰度比或相当不同丰度比的数据集,但不适用于两者。
结果:在本文中,我们提出了 MetaCluster 3.0,这是一种基于无监督自上而下的分离和自下而上的合并策略的集成分箱方法,它可以对丰度比非常平衡(例如 1:1)到非常不同(例如 1:24)的物种的宏基因组片段进行分箱,其准确性始终高于现有方法。
可用性:MetaCluster 3.0 可在 http://i.cs.hku.hk/~alse/MetaCluster/ 下载。
BMC Bioinformatics. 2010-4-16
IEEE/ACM Trans Comput Biol Bioinform. 2014
J Comput Biol. 2012-2
J Comput Biol. 2011-3
BMC Bioinformatics. 2017-12-28
Brief Bioinform. 2024-7-25
Front Microbiol. 2022-12-2
Annu Rev Biomed Data Sci. 2018-7
BMC Bioinformatics. 2019-11-22
Nat Methods. 2019-6-10
Genes (Basel). 2018-6-20
PLoS One. 2018-6-14
Adv Virus Res. 2017-9-21