Zhang Ruichang, Cheng Zhanzhan, Guan Jihong, Zhou Shuigeng
BMC Bioinformatics. 2015;16 Suppl 5(Suppl 5):S2. doi: 10.1186/1471-2105-16-S5-S2. Epub 2015 Mar 18.
With the rapid development of high-throughput technologies, researchers can sequence the whole metagenome of a microbial community sampled directly from the environment. The assignment of these metagenomic reads into different species or taxonomical classes is a vital step for metagenomic analysis, which is referred to as binning of metagenomic data.
In this paper, we propose a new method TM-MCluster for binning metagenomic reads. First, we represent each metagenomic read as a set of "k-mers" with their frequencies occurring in the read. Then, we employ a probabilistic topic model -- the Latent Dirichlet Allocation (LDA) model to the reads, which generates a number of hidden "topics" such that each read can be represented by a distribution vector of the generated topics. Finally, as in the MCluster method, we apply SKWIC -- a variant of the classical K-means algorithm with automatic feature weighting mechanism to cluster these reads represented by topic distributions.
Experiments show that the new method TM-MCluster outperforms major existing methods, including AbundanceBin, MetaCluster 3.0/5.0 and MCluster. This result indicates that the exploitation of topic modeling can effectively improve the binning performance of metagenomic reads.
随着高通量技术的快速发展,研究人员能够对直接从环境中采样得到的微生物群落的整个宏基因组进行测序。将这些宏基因组 reads 分配到不同物种或分类类别中是宏基因组分析的关键步骤,这被称为宏基因组数据的分箱。
在本文中,我们提出了一种用于宏基因组 reads 分箱的新方法 TM-MCluster。首先,我们将每个宏基因组 read 表示为一组“k-mer”及其在 read 中出现的频率。然后,我们将概率主题模型——潜在狄利克雷分配(LDA)模型应用于这些 reads,该模型生成一些隐藏的“主题”,使得每个 read 可以由生成主题的分布向量表示。最后,如同在 MCluster 方法中一样,我们应用 SKWIC——一种具有自动特征加权机制的经典 K 均值算法的变体,对由主题分布表示的这些 reads 进行聚类。
实验表明,新方法 TM-MCluster 优于主要现有的方法,包括 AbundanceBin、MetaCluster 3.0/5.0 和 MCluster。这一结果表明,主题建模的应用能够有效提高宏基因组 reads 的分箱性能。