SCRAPT：一种用于聚类大型 16S rRNA 基因数据集的迭代算法。

SCRAPT: an iterative algorithm for clustering large 16S rRNA gene data sets.

机构信息

Department of Computer Science, University of Maryland, College Park, 20742 MD, USA.

Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, USA.

出版信息

Nucleic Acids Res. 2023 May 8;51(8):e46. doi: 10.1093/nar/gkad158.

DOI:10.1093/nar/gkad158

PMID:36912074

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10164572/

Abstract

16S rRNA gene sequence clustering is an important tool in characterizing the diversity of microbial communities. As 16S rRNA gene data sets are growing in size, existing sequence clustering algorithms increasingly become an analytical bottleneck. Part of this bottleneck is due to the substantial computational cost expended on small clusters and singleton sequences. We propose an iterative sampling-based 16S rRNA gene sequence clustering approach that targets the largest clusters in the data set, allowing users to stop the clustering process when sufficient clusters are available for the specific analysis being targeted. We describe a probabilistic analysis of the iterative clustering process that supports the intuition that the clustering process identifies the larger clusters in the data set first. Using real data sets of 16S rRNA gene sequences, we show that the iterative algorithm, coupled with an adaptive sampling process and a mode-shifting strategy for identifying cluster representatives, substantially speeds up the clustering process while being effective at capturing the large clusters in the data set. The experiments also show that SCRAPT (Sample, Cluster, Recruit, AdaPt and iTerate) is able to produce operational taxonomic units that are less fragmented than popular tools: UCLUST, CD-HIT and DNACLUST. The algorithm is implemented in the open-source package SCRAPT. The source code used to generate the results presented in this paper is available at https://github.com/hsmurali/SCRAPT.

摘要

16S rRNA 基因序列聚类是描述微生物群落多样性的重要工具。随着 16S rRNA 基因数据集规模的不断增大，现有的序列聚类算法越来越成为分析的瓶颈。造成这种瓶颈的部分原因是，小聚类和单序列需要耗费大量的计算成本。我们提出了一种基于迭代抽样的 16S rRNA 基因序列聚类方法，该方法针对数据集的最大聚类，允许用户在针对特定分析获得足够的聚类时停止聚类过程。我们对迭代聚类过程进行了概率分析，该分析支持这样一种直觉，即聚类过程首先识别数据集中的较大聚类。使用真实的 16S rRNA 基因序列数据集，我们表明，迭代算法结合自适应抽样过程和用于识别聚类代表的模式转移策略，可以大大加快聚类过程，同时有效地捕获数据集中的大聚类。实验还表明，SCRAPT（采样、聚类、招募、适应和迭代）能够生成比流行工具 UCLUST、CD-HIT 和 DNACLUST 更少碎片化的分类操作单元。该算法在开源软件包 SCRAPT 中实现。本文中呈现结果所使用的源代码可在 https://github.com/hsmurali/SCRAPT 上获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5be5/10164572/a16209d581c9/gkad158fig1.jpg

相似文献

SCRAPT: an iterative algorithm for clustering large 16S rRNA gene data sets.SCRAPT：一种用于聚类大型 16S rRNA 基因数据集的迭代算法。

Nucleic Acids Res. 2023 May 8;51(8):e46. doi: 10.1093/nar/gkad158.

DNACLUST: accurate and efficient clustering of phylogenetic marker genes.DNACLUST：准确高效的系统发育标记基因聚类

BMC Bioinformatics. 2011 Jun 30;12:271. doi: 10.1186/1471-2105-12-271.

CLUSTOM: a novel method for clustering 16S rRNA next generation sequences by overlap minimization.CLUSTOM：一种通过最小化重叠来聚类 16S rRNA 下一代序列的新方法。

PLoS One. 2013 May 1;8(5):e62623. doi: 10.1371/journal.pone.0062623. Print 2013.

DBH: A de Bruijn graph-based heuristic method for clustering large-scale 16S rRNA sequences into OTUs.DBH：一种基于德布鲁因图的启发式方法，用于将大规模16S rRNA序列聚类为操作分类单元。

J Theor Biol. 2017 Jul 21;425:80-87. doi: 10.1016/j.jtbi.2017.04.019. Epub 2017 Apr 26.

A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences.基于语法的距离度量能够快速、准确地对大量 16S 序列进行聚类。

BMC Bioinformatics. 2010 Dec 17;11:601. doi: 10.1186/1471-2105-11-601.

Updating the 97% identity threshold for 16S ribosomal RNA OTUs.更新 16S 核糖体 RNA OTUs 的 97%同一性阈值。

Bioinformatics. 2018 Jul 15;34(14):2371-2375. doi: 10.1093/bioinformatics/bty113.

DySC: software for greedy clustering of 16S rRNA reads.DySC：用于 16S rRNA reads 贪心聚类的软件。

Bioinformatics. 2012 Aug 15;28(16):2182-3. doi: 10.1093/bioinformatics/bts355. Epub 2012 Jun 23.

CLUSTOM-CLOUD: In-Memory Data Grid-Based Software for Clustering 16S rRNA Sequence Data in the Cloud Environment.CLUSTOM-CLOUD：用于在云环境中对16S rRNA序列数据进行聚类的基于内存数据网格的软件。

PLoS One. 2016 Mar 8;11(3):e0151064. doi: 10.1371/journal.pone.0151064. eCollection 2016.

OptiFit: an Improved Method for Fitting Amplicon Sequences to Existing OTUs.OptiFit：一种改进的扩增子序列与现有 OTU 拟合方法。

mSphere. 2022 Feb 23;7(1):e0091621. doi: 10.1128/msphere.00916-21. Epub 2022 Feb 2.

bioOTU: An Improved Method for Simultaneous Taxonomic Assignments and Operational Taxonomic Units Clustering of 16s rRNA Gene Sequences.生物OTU：一种用于16S rRNA基因序列分类分配和操作分类单元聚类的改进方法。

J Comput Biol. 2016 Apr;23(4):229-38. doi: 10.1089/cmb.2015.0214. Epub 2016 Mar 7.

本文引用的文献

Differential richness inference for 16S rRNA marker gene surveys.16S rRNA 标记基因调查的差异丰富度推断。

Genome Biol. 2022 Aug 1;23(1):166. doi: 10.1186/s13059-022-02722-x.

MeShClust v3.0: high-quality clustering of DNA sequences using the mean shift algorithm and alignment-free identity scores.MeShClust v3.0：使用均值漂移算法和无比对身份分数对 DNA 序列进行高质量聚类。

BMC Genomics. 2022 Jun 6;23(1):423. doi: 10.1186/s12864-022-08619-0.

Empirical evaluation of methods for genome assembly.基因组组装方法的实证评估。

PeerJ Comput Sci. 2021 Jul 9;7:e636. doi: 10.7717/peerj-cs.636. eCollection 2021.

A critical assessment of gene catalogs for metagenomic analysis.对宏基因组分析基因目录的批判性评估。

Bioinformatics. 2021 Sep 29;37(18):2848-2857. doi: 10.1093/bioinformatics/btab216.

Regulation of neonatal IgA production by the maternal microbiota.母体微生物群对新生儿 IgA 产生的调节作用。

Proc Natl Acad Sci U S A. 2021 Mar 2;118(9). doi: 10.1073/pnas.2015691118.

Gut Microbiota and Bacterial DNA Suppress Autoimmunity by Stimulating Regulatory B Cells in a Murine Model of Lupus.肠道微生物群和细菌 DNA 通过刺激狼疮小鼠模型中的调节性 B 细胞来抑制自身免疫。

Front Immunol. 2020 Nov 10;11:593353. doi: 10.3389/fimmu.2020.593353. eCollection 2020.

Interpretations of Environmental Microbial Community Studies Are Biased by the Selected 16S rRNA (Gene) Amplicon Sequencing Pipeline.环境微生物群落研究的解读因所选16S rRNA（基因）扩增子测序流程而存在偏差。

Front Microbiol. 2020 Oct 23;11:550420. doi: 10.3389/fmicb.2020.550420. eCollection 2020.

Comparing bioinformatic pipelines for microbial 16S rRNA amplicon sequencing.比较微生物 16S rRNA 扩增子测序的生物信息学分析流程。

PLoS One. 2020 Jan 16;15(1):e0227434. doi: 10.1371/journal.pone.0227434. eCollection 2020.

Global Trends in Marine Plankton Diversity across Kingdoms of Life.全球海洋浮游生物多样性在生命王国中的趋势。

Cell. 2019 Nov 14;179(5):1084-1097.e21. doi: 10.1016/j.cell.2019.10.008.

Evaluation of 16S rRNA gene sequencing for species and strain-level microbiome analysis.16S rRNA 基因测序在微生物组物种和菌株水平分析中的评估。

Nat Commun. 2019 Nov 6;10(1):5029. doi: 10.1038/s41467-019-13036-1.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

SCRAPT：一种用于聚类大型 16S rRNA 基因数据集的迭代算法。

SCRAPT: an iterative algorithm for clustering large 16S rRNA gene data sets.

机构信息

出版信息

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献