College of Computer and Information Science, Southwest University, Chongqing, China.
School of Life Sciences and Partner State Key Laboratory of Agrobiotechnology, The Chinese University of Hong Kong, Shatin, Hong Kong SAR, China.
Bioinformatics. 2018 Dec 15;34(24):4172-4179. doi: 10.1093/bioinformatics/bty519.
Metagenomics investigates the DNA sequences directly recovered from environmental samples. It often starts with reads assembly, which leads to contigs rather than more complete genomes. Therefore, contig binning methods are subsequently used to bin contigs into genome bins. While some clustering-based binning methods have been developed, they generally suffer from problems related to stability and robustness.
We introduce BMC3C, an ensemble clustering-based method, to accurately and robustly bin contigs by making use of DNA sequence Composition, Coverage across multiple samples and Codon usage. BMC3C begins by searching the proper number of clusters and repeatedly applying the k-means clustering with different initializations to cluster contigs. Next, a weight graph with each node representing a contig is derived from these clusters. If two contigs are frequently grouped into the same cluster, the weight between them is high, and otherwise low. BMC3C finally employs a graph partitioning technique to partition the weight graph into subgraphs, each corresponding to a genome bin. We conduct experiments on both simulated and real-world datasets to evaluate BMC3C, and compare it with the state-of-the-art binning tools. We show that BMC3C has an improved performance compared to these tools. To our knowledge, this is the first time that the codon usage features and ensemble clustering are used in metagenomic contig binning.
The codes of BMC3C are available at http://mlda.swu.edu.cn/codes.php?name=BMC3C.
Supplementary data are available at Bioinformatics online.
宏基因组学直接从环境样本中研究 DNA 序列。它通常从读取组装开始,这导致了 contigs 而不是更完整的基因组。因此,随后使用 contig 分箱方法将 contigs 分箱到基因组 bin 中。虽然已经开发了一些基于聚类的分箱方法,但它们通常存在与稳定性和鲁棒性相关的问题。
我们引入了 BMC3C,这是一种基于集合聚类的方法,通过利用 DNA 序列组成、多个样本的覆盖度和密码子使用情况来准确而稳健地对 contigs 进行分箱。BMC3C 首先搜索适当数量的聚类,并使用不同初始化的 k-means 聚类重复地对 contigs 进行聚类。接下来,从这些聚类中得出一个带有每个节点表示一个 contig 的权重图。如果两个 contigs 经常被分组到同一个聚类中,则它们之间的权重较高,否则权重较低。BMC3C 最后采用图划分技术将权重图划分为子图,每个子图对应一个基因组 bin。我们在模拟和真实数据集上进行实验来评估 BMC3C,并将其与最先进的分箱工具进行比较。我们表明,BMC3C 与这些工具相比具有改进的性能。据我们所知,这是首次在宏基因组 contig 分箱中使用密码子使用特征和集合聚类。
BMC3C 的代码可在 http://mlda.swu.edu.cn/codes.php?name=BMC3C 获得。
补充数据可在 Bioinformatics 在线获得。