Department of Computer Science, University of Maryland, College Park, 20742 MD, USA.
Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, USA.
Nucleic Acids Res. 2023 May 8;51(8):e46. doi: 10.1093/nar/gkad158.
16S rRNA gene sequence clustering is an important tool in characterizing the diversity of microbial communities. As 16S rRNA gene data sets are growing in size, existing sequence clustering algorithms increasingly become an analytical bottleneck. Part of this bottleneck is due to the substantial computational cost expended on small clusters and singleton sequences. We propose an iterative sampling-based 16S rRNA gene sequence clustering approach that targets the largest clusters in the data set, allowing users to stop the clustering process when sufficient clusters are available for the specific analysis being targeted. We describe a probabilistic analysis of the iterative clustering process that supports the intuition that the clustering process identifies the larger clusters in the data set first. Using real data sets of 16S rRNA gene sequences, we show that the iterative algorithm, coupled with an adaptive sampling process and a mode-shifting strategy for identifying cluster representatives, substantially speeds up the clustering process while being effective at capturing the large clusters in the data set. The experiments also show that SCRAPT (Sample, Cluster, Recruit, AdaPt and iTerate) is able to produce operational taxonomic units that are less fragmented than popular tools: UCLUST, CD-HIT and DNACLUST. The algorithm is implemented in the open-source package SCRAPT. The source code used to generate the results presented in this paper is available at https://github.com/hsmurali/SCRAPT.
16S rRNA 基因序列聚类是描述微生物群落多样性的重要工具。随着 16S rRNA 基因数据集规模的不断增大,现有的序列聚类算法越来越成为分析的瓶颈。造成这种瓶颈的部分原因是,小聚类和单序列需要耗费大量的计算成本。我们提出了一种基于迭代抽样的 16S rRNA 基因序列聚类方法,该方法针对数据集的最大聚类,允许用户在针对特定分析获得足够的聚类时停止聚类过程。我们对迭代聚类过程进行了概率分析,该分析支持这样一种直觉,即聚类过程首先识别数据集中的较大聚类。使用真实的 16S rRNA 基因序列数据集,我们表明,迭代算法结合自适应抽样过程和用于识别聚类代表的模式转移策略,可以大大加快聚类过程,同时有效地捕获数据集中的大聚类。实验还表明,SCRAPT(采样、聚类、招募、适应和迭代)能够生成比流行工具 UCLUST、CD-HIT 和 DNACLUST 更少碎片化的分类操作单元。该算法在开源软件包 SCRAPT 中实现。本文中呈现结果所使用的源代码可在 https://github.com/hsmurali/SCRAPT 上获得。