Key Laboratory of Information Fusion Technology of Ministry of Education, School of Automation, Northwestern Polytechnical University, Xi'an, 710072, China.
Mol Inform. 2017 Dec;36(12). doi: 10.1002/minf.201600059. Epub 2017 Jun 6.
Clustering 16S rRNA sequences into operational taxonomic units (OTUs) is a crucial step in analyzing metagenomic data. Although many methods have been developed, how to obtain an appropriate balance between clustering accuracy and computational efficiency is still a major challenge. A novel density-based modularity clustering method, called DMclust, is proposed in this paper to bin 16S rRNA sequences into OTUs with high clustering accuracy. The DMclust algorithm consists of four main phases. It first searches for the sequence dense group defined as n-sequence community, in which the distance between any two sequences is less than a threshold. Then these dense groups are used to construct a weighted network, where dense groups are viewed as nodes, each pair of dense groups is connected by an edge, and the distance of pairwise groups represents the weight of the edge. Then, a modularity-based community detection method is employed to generate the preclusters. Finally, the remaining sequences are assigned to their nearest preclusters to form OTUs. Compared with existing widely used methods, the experimental results on several metagenomic datasets show that DMclust has higher accurate clustering performance with acceptable memory usage.
将 16S rRNA 序列聚类为操作分类单元 (OTUs) 是分析宏基因组数据的关键步骤。尽管已经开发了许多方法,但如何在聚类准确性和计算效率之间取得适当的平衡仍然是一个主要挑战。本文提出了一种新的基于密度的模块聚类方法 DMclust,用于将 16S rRNA 序列聚类为具有高聚类准确性的 OTUs。DMclust 算法由四个主要阶段组成。它首先搜索定义为 n-序列社区的序列密集组,其中任意两个序列之间的距离小于阈值。然后,这些密集组用于构建加权网络,其中密集组被视为节点,每个密集组对之间通过边连接,并且对组的距离表示边的权重。然后,使用基于模块性的社区检测方法生成预聚类。最后,将剩余的序列分配给它们最近的预聚类以形成 OTUs。与现有的广泛使用的方法相比,在几个宏基因组数据集上的实验结果表明,DMclust 具有更高的准确聚类性能和可接受的内存使用。