ESPRIT-Tree：准线性计算时间内对数百万 16S rRNA 焦磷酸测序进行层次聚类分析。

ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time.

机构信息

Interdisciplinary Center for Biotechnology Research University of Florida, Gainesville, FL 32610, USA.

出版信息

Nucleic Acids Res. 2011 Aug;39(14):e95. doi: 10.1093/nar/gkr349. Epub 2011 May 19.

DOI:10.1093/nar/gkr349

PMID:21596775

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3152367/

Abstract

Taxonomy-independent analysis plays an essential role in microbial community analysis. Hierarchical clustering is one of the most widely employed approaches to finding operational taxonomic units, the basis for many downstream analyses. Most existing algorithms have quadratic space and computational complexities, and thus can be used only for small or medium-scale problems. We propose a new online learning-based algorithm that simultaneously addresses the space and computational issues of prior work. The basic idea is to partition a sequence space into a set of subspaces using a partition tree constructed using a pseudometric, then recursively refine a clustering structure in these subspaces. The technique relies on new methods for fast closest-pair searching and efficient dynamic insertion and deletion of tree nodes. To avoid exhaustive computation of pairwise distances between clusters, we represent each cluster of sequences as a probabilistic sequence, and define a set of operations to align these probabilistic sequences and compute genetic distances between them. We present analyses of space and computational complexity, and demonstrate the effectiveness of our new algorithm using a human gut microbiota data set with over one million sequences. The new algorithm exhibits a quasilinear time and space complexity comparable to greedy heuristic clustering algorithms, while achieving a similar accuracy to the standard hierarchical clustering algorithm.

摘要

非分类学分析在微生物群落分析中起着至关重要的作用。层次聚类是发现操作分类单元（许多下游分析的基础）的最广泛使用的方法之一。大多数现有算法具有二次空间和计算复杂度，因此只能用于小或中等规模的问题。我们提出了一种新的基于在线学习的算法，同时解决了先前工作的空间和计算问题。基本思想是使用基于伪度量的分区树将序列空间划分为一组子空间，然后在这些子空间中递归地细化聚类结构。该技术依赖于快速最近对搜索和有效动态插入和删除树节点的新方法。为避免在聚类之间对成对距离进行穷举计算，我们将每个序列聚类表示为概率序列，并定义了一组操作来对齐这些概率序列并计算它们之间的遗传距离。我们分析了空间和计算复杂度，并使用超过一百万条序列的人类肠道微生物组数据集证明了我们新算法的有效性。新算法表现出与贪婪启发式聚类算法相当的准线性时间和空间复杂度，同时达到与标准层次聚类算法相似的准确性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8d0f/3152367/04c4079a19cb/gkr349f1.jpg

相似文献

ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time.ESPRIT-Tree：准线性计算时间内对数百万 16S rRNA 焦磷酸测序进行层次聚类分析。

Nucleic Acids Res. 2011 Aug;39(14):e95. doi: 10.1093/nar/gkr349. Epub 2011 May 19.

ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time.ESPRIT-Forest：在亚二次时间内对海量扩增子序列数据进行并行聚类

PLoS Comput Biol. 2017 Apr 24;13(4):e1005518. doi: 10.1371/journal.pcbi.1005518. eCollection 2017 Apr.

CLUSTOM-CLOUD: In-Memory Data Grid-Based Software for Clustering 16S rRNA Sequence Data in the Cloud Environment.CLUSTOM-CLOUD：用于在云环境中对16S rRNA序列数据进行聚类的基于内存数据网格的软件。

PLoS One. 2016 Mar 8;11(3):e0151064. doi: 10.1371/journal.pone.0151064. eCollection 2016.

A large-scale benchmark study of existing algorithms for taxonomy-independent microbial community analysis.大规模基准研究现有的分类学独立微生物群落分析算法。

Brief Bioinform. 2012 Jan;13(1):107-21. doi: 10.1093/bib/bbr009. Epub 2011 Apr 27.

hc-OTU: A Fast and Accurate Method for Clustering Operational Taxonomic Units Based on Homopolymer Compaction.hc-OTU：一种基于同源多聚体压缩的快速准确的操作分类单元聚类方法。

IEEE/ACM Trans Comput Biol Bioinform. 2018 Mar-Apr;15(2):441-451. doi: 10.1109/TCBB.2016.2535326. Epub 2016 Feb 26.

CLUSTOM: a novel method for clustering 16S rRNA next generation sequences by overlap minimization.CLUSTOM：一种通过最小化重叠来聚类 16S rRNA 下一代序列的新方法。

PLoS One. 2013 May 1;8(5):e62623. doi: 10.1371/journal.pone.0062623. Print 2013.

DBH: A de Bruijn graph-based heuristic method for clustering large-scale 16S rRNA sequences into OTUs.DBH：一种基于德布鲁因图的启发式方法，用于将大规模16S rRNA序列聚类为操作分类单元。

J Theor Biol. 2017 Jul 21;425:80-87. doi: 10.1016/j.jtbi.2017.04.019. Epub 2017 Apr 26.

DMclust, a Density-based Modularity Method for Accurate OTU Picking of 16S rRNA Sequences.DMclust，一种基于密度的 OTU 聚类方法，用于准确提取 16S rRNA 序列。

Mol Inform. 2017 Dec;36(12). doi: 10.1002/minf.201600059. Epub 2017 Jun 6.

MSClust: A Multi-Seeds based Clustering algorithm for microbiome profiling using 16S rRNA sequence.MSClust：一种基于多种子的微生物组 profiling 聚类算法，使用 16S rRNA 序列。

J Microbiol Methods. 2013 Sep;94(3):347-55. doi: 10.1016/j.mimet.2013.07.004. Epub 2013 Jul 28.

M-pick, a modularity-based method for OTU picking of 16S rRNA sequences.M-pick，一种基于模块度的 16S rRNA 序列 OTU 划分方法。

BMC Bioinformatics. 2013 Feb 7;14:43. doi: 10.1186/1471-2105-14-43.

引用本文的文献

Accurately clustering biological sequences in linear time by relatedness sorting.通过相关排序在线性时间内准确地对生物序列进行聚类。

Nat Commun. 2024 Apr 8;15(1):3047. doi: 10.1038/s41467-024-47371-9.

Alignment-free comparison of metagenomics sequences via approximate string matching.通过近似字符串匹配对宏基因组序列进行无比对比较。

Bioinform Adv. 2022 Oct 21;2(1):vbac077. doi: 10.1093/bioadv/vbac077. eCollection 2022.

High-throughput proteomics: a methodological mini-review.高通量蛋白质组学：方法学小综述。

Lab Invest. 2022 Nov;102(11):1170-1181. doi: 10.1038/s41374-022-00830-7. Epub 2022 Aug 3.

Machine Learning Advances in Microbiology: A Review of Methods and Applications.微生物学中的机器学习进展：方法与应用综述

Front Microbiol. 2022 May 26;13:925454. doi: 10.3389/fmicb.2022.925454. eCollection 2022.

Machine Learning as a Tool in Investigating the Possible Role of Microbiome in Development and Treatment of Cancer.机器学习作为研究微生物组在癌症发展和治疗中可能作用的工具。

Cureus. 2021 Aug 24;13(8):e17415. doi: 10.7759/cureus.17415. eCollection 2021 Aug.

An Introduction to Next Generation Sequencing Bioinformatic Analysis in Gut Microbiome Studies.肠道微生物组研究中下一代测序生物信息分析简介。

Biomolecules. 2021 Apr 2;11(4):530. doi: 10.3390/biom11040530.

Comparison of Methods for Picking the Operational Taxonomic Units From Amplicon Sequences.从扩增子序列中挑选操作分类单元的方法比较

Front Microbiol. 2021 Mar 24;12:644012. doi: 10.3389/fmicb.2021.644012. eCollection 2021.

Gut microbiome, body weight, and mammographic breast density in healthy postmenopausal women.健康绝经后女性的肠道微生物组、体重和乳房 X 光密度。

Cancer Causes Control. 2021 Jul;32(7):681-692. doi: 10.1007/s10552-021-01420-6. Epub 2021 Mar 27.

Applications of Machine Learning in Human Microbiome Studies: A Review on Feature Selection, Biomarker Identification, Disease Prediction and Treatment.机器学习在人类微生物组研究中的应用：特征选择、生物标志物识别、疾病预测与治疗综述

Front Microbiol. 2021 Feb 19;12:634511. doi: 10.3389/fmicb.2021.634511. eCollection 2021.

A critical analysis of state-of-the-art metagenomics OTU clustering algorithms.对最先进的宏基因组 OTU 聚类算法的批判性分析。

J Biosci. 2019 Dec;44(6).

本文引用的文献

A large-scale benchmark study of existing algorithms for taxonomy-independent microbial community analysis.大规模基准研究现有的分类学独立微生物群落分析算法。

Brief Bioinform. 2012 Jan;13(1):107-21. doi: 10.1093/bib/bbr009. Epub 2011 Apr 27.

Advanced computational algorithms for microbial community analysis using massive 16S rRNA sequence data.利用大量 16S rRNA 序列数据进行微生物群落分析的高级计算算法。

Nucleic Acids Res. 2010 Dec;38(22):e205. doi: 10.1093/nar/gkq872. Epub 2010 Oct 6.

Search and clustering orders of magnitude faster than BLAST.比 BLAST 快几个数量级的搜索和聚类。

Bioinformatics. 2010 Oct 1;26(19):2460-1. doi: 10.1093/bioinformatics/btq461. Epub 2010 Aug 12.

QIIME allows analysis of high-throughput community sequencing data.QIIME可用于分析高通量群落测序数据。

Nat Methods. 2010 May;7(5):335-6. doi: 10.1038/nmeth.f.303. Epub 2010 Apr 11.

Alignment and clustering of phylogenetic markers--implications for microbial diversity studies.系统发育标记的聚类与对齐——对微生物多样性研究的启示。

BMC Bioinformatics. 2010 Mar 24;11:152. doi: 10.1186/1471-2105-11-152.

Ironing out the wrinkles in the rare biosphere through improved OTU clustering.通过改进的 OTU 聚类来消除稀有生物群落中的褶皱。

Environ Microbiol. 2010 Jul;12(7):1889-98. doi: 10.1111/j.1462-2920.2010.02193.x. Epub 2010 Mar 11.

The NIH Human Microbiome Project.美国国立卫生研究院人类微生物组计划。

Genome Res. 2009 Dec;19(12):2317-23. doi: 10.1101/gr.096651.109. Epub 2009 Oct 9.

Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities.介绍 mothur：开源、独立于平台、社区支持的软件，用于描述和比较微生物群落。

Appl Environ Microbiol. 2009 Dec;75(23):7537-41. doi: 10.1128/AEM.01541-09. Epub 2009 Oct 2.

Accurate determination of microbial diversity from 454 pyrosequencing data.从454焦磷酸测序数据中准确测定微生物多样性。

Nat Methods. 2009 Sep;6(9):639-41. doi: 10.1038/nmeth.1361. Epub 2009 Aug 9.

ESPRIT: estimating species richness using large collections of 16S rRNA pyrosequences.ESPRIT：利用大量16S rRNA焦磷酸测序序列估计物种丰富度。

Nucleic Acids Res. 2009 Jun;37(10):e76. doi: 10.1093/nar/gkp285. Epub 2009 May 5.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

ESPRIT-Tree：准线性计算时间内对数百万 16S rRNA 焦磷酸测序进行层次聚类分析。

ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献