Interdisciplinary Center for Biotechnology Research University of Florida, Gainesville, FL 32610, USA.
Nucleic Acids Res. 2011 Aug;39(14):e95. doi: 10.1093/nar/gkr349. Epub 2011 May 19.
Taxonomy-independent analysis plays an essential role in microbial community analysis. Hierarchical clustering is one of the most widely employed approaches to finding operational taxonomic units, the basis for many downstream analyses. Most existing algorithms have quadratic space and computational complexities, and thus can be used only for small or medium-scale problems. We propose a new online learning-based algorithm that simultaneously addresses the space and computational issues of prior work. The basic idea is to partition a sequence space into a set of subspaces using a partition tree constructed using a pseudometric, then recursively refine a clustering structure in these subspaces. The technique relies on new methods for fast closest-pair searching and efficient dynamic insertion and deletion of tree nodes. To avoid exhaustive computation of pairwise distances between clusters, we represent each cluster of sequences as a probabilistic sequence, and define a set of operations to align these probabilistic sequences and compute genetic distances between them. We present analyses of space and computational complexity, and demonstrate the effectiveness of our new algorithm using a human gut microbiota data set with over one million sequences. The new algorithm exhibits a quasilinear time and space complexity comparable to greedy heuristic clustering algorithms, while achieving a similar accuracy to the standard hierarchical clustering algorithm.
非分类学分析在微生物群落分析中起着至关重要的作用。层次聚类是发现操作分类单元(许多下游分析的基础)的最广泛使用的方法之一。大多数现有算法具有二次空间和计算复杂度,因此只能用于小或中等规模的问题。我们提出了一种新的基于在线学习的算法,同时解决了先前工作的空间和计算问题。基本思想是使用基于伪度量的分区树将序列空间划分为一组子空间,然后在这些子空间中递归地细化聚类结构。该技术依赖于快速最近对搜索和有效动态插入和删除树节点的新方法。为避免在聚类之间对成对距离进行穷举计算,我们将每个序列聚类表示为概率序列,并定义了一组操作来对齐这些概率序列并计算它们之间的遗传距离。我们分析了空间和计算复杂度,并使用超过一百万条序列的人类肠道微生物组数据集证明了我们新算法的有效性。新算法表现出与贪婪启发式聚类算法相当的准线性时间和空间复杂度,同时达到与标准层次聚类算法相似的准确性。