Suppr超能文献

SCRAPT:一种用于聚类大型 16S rRNA 基因数据集的迭代算法。

SCRAPT: an iterative algorithm for clustering large 16S rRNA gene data sets.

机构信息

Department of Computer Science, University of Maryland, College Park, 20742 MD, USA.

Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, USA.

出版信息

Nucleic Acids Res. 2023 May 8;51(8):e46. doi: 10.1093/nar/gkad158.

Abstract

16S rRNA gene sequence clustering is an important tool in characterizing the diversity of microbial communities. As 16S rRNA gene data sets are growing in size, existing sequence clustering algorithms increasingly become an analytical bottleneck. Part of this bottleneck is due to the substantial computational cost expended on small clusters and singleton sequences. We propose an iterative sampling-based 16S rRNA gene sequence clustering approach that targets the largest clusters in the data set, allowing users to stop the clustering process when sufficient clusters are available for the specific analysis being targeted. We describe a probabilistic analysis of the iterative clustering process that supports the intuition that the clustering process identifies the larger clusters in the data set first. Using real data sets of 16S rRNA gene sequences, we show that the iterative algorithm, coupled with an adaptive sampling process and a mode-shifting strategy for identifying cluster representatives, substantially speeds up the clustering process while being effective at capturing the large clusters in the data set. The experiments also show that SCRAPT (Sample, Cluster, Recruit, AdaPt and iTerate) is able to produce operational taxonomic units that are less fragmented than popular tools: UCLUST, CD-HIT and DNACLUST. The algorithm is implemented in the open-source package SCRAPT. The source code used to generate the results presented in this paper is available at https://github.com/hsmurali/SCRAPT.

摘要

16S rRNA 基因序列聚类是描述微生物群落多样性的重要工具。随着 16S rRNA 基因数据集规模的不断增大,现有的序列聚类算法越来越成为分析的瓶颈。造成这种瓶颈的部分原因是,小聚类和单序列需要耗费大量的计算成本。我们提出了一种基于迭代抽样的 16S rRNA 基因序列聚类方法,该方法针对数据集的最大聚类,允许用户在针对特定分析获得足够的聚类时停止聚类过程。我们对迭代聚类过程进行了概率分析,该分析支持这样一种直觉,即聚类过程首先识别数据集中的较大聚类。使用真实的 16S rRNA 基因序列数据集,我们表明,迭代算法结合自适应抽样过程和用于识别聚类代表的模式转移策略,可以大大加快聚类过程,同时有效地捕获数据集中的大聚类。实验还表明,SCRAPT(采样、聚类、招募、适应和迭代)能够生成比流行工具 UCLUST、CD-HIT 和 DNACLUST 更少碎片化的分类操作单元。该算法在开源软件包 SCRAPT 中实现。本文中呈现结果所使用的源代码可在 https://github.com/hsmurali/SCRAPT 上获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5be5/10164572/a16209d581c9/gkad158fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验