基于语法的距离度量能够快速、准确地对大量 16S 序列进行聚类。

A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences.

机构信息

Department of Electrical Engineering, University of Nebraska-Lincoln, 209N WSEC, Lincoln, NE 68588-0511, USA.

出版信息

BMC Bioinformatics. 2010 Dec 17;11:601. doi: 10.1186/1471-2105-11-601.

DOI:10.1186/1471-2105-11-601

PMID:21167044

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3022630/

Abstract

BACKGROUND

We propose a sequence clustering algorithm and compare the partition quality and execution time of the proposed algorithm with those of a popular existing algorithm. The proposed clustering algorithm uses a grammar-based distance metric to determine partitioning for a set of biological sequences. The algorithm performs clustering in which new sequences are compared with cluster-representative sequences to determine membership. If comparison fails to identify a suitable cluster, a new cluster is created.

RESULTS

The performance of the proposed algorithm is validated via comparison to the popular DNA/RNA sequence clustering approach, CD-HIT-EST, and to the recently developed algorithm, UCLUST, using two different sets of 16S rDNA sequences from 2,255 genera. The proposed algorithm maintains a comparable CPU execution time with that of CD-HIT-EST which is much slower than UCLUST, and has successfully generated clusters with higher statistical accuracy than both CD-HIT-EST and UCLUST. The validation results are especially striking for large datasets.

CONCLUSIONS

We introduce a fast and accurate clustering algorithm that relies on a grammar-based sequence distance. Its statistical clustering quality is validated by clustering large datasets containing 16S rDNA sequences.

摘要

背景

我们提出了一种序列聚类算法，并将其与一种流行的现有算法的分区质量和执行时间进行了比较。所提出的聚类算法使用基于语法的距离度量来确定一组生物序列的分区。该算法执行聚类，其中新序列与聚类代表序列进行比较以确定成员身份。如果比较无法识别合适的聚类，则创建一个新的聚类。

结果

通过与流行的 DNA/RNA 序列聚类方法 CD-HIT-EST 以及最近开发的 UCLUST 算法在来自 2255 个属的两个不同的 16S rDNA 序列集上进行比较，验证了所提出算法的性能。所提出的算法与 CD-HIT-EST 的 CPU 执行时间相当，而 CD-HIT-EST 比 UCLUST 慢得多，并且成功生成了比 CD-HIT-EST 和 UCLUST 都具有更高统计准确性的聚类。验证结果对于大型数据集尤其引人注目。

结论

我们引入了一种快速准确的聚类算法，该算法依赖于基于语法的序列距离。通过对包含 16S rDNA 序列的大型数据集进行聚类，验证了其统计聚类质量。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0cb9/3022630/13279b6b24be/1471-2105-11-601-1.jpg

相似文献

A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences.基于语法的距离度量能够快速、准确地对大量 16S 序列进行聚类。

BMC Bioinformatics. 2010 Dec 17;11:601. doi: 10.1186/1471-2105-11-601.

MeShClust v3.0: high-quality clustering of DNA sequences using the mean shift algorithm and alignment-free identity scores.MeShClust v3.0：使用均值漂移算法和无比对身份分数对 DNA 序列进行高质量聚类。

BMC Genomics. 2022 Jun 6;23(1):423. doi: 10.1186/s12864-022-08619-0.

SCRAPT: an iterative algorithm for clustering large 16S rRNA gene data sets.SCRAPT：一种用于聚类大型 16S rRNA 基因数据集的迭代算法。

Nucleic Acids Res. 2023 May 8;51(8):e46. doi: 10.1093/nar/gkad158.

DNACLUST: accurate and efficient clustering of phylogenetic marker genes.DNACLUST：准确高效的系统发育标记基因聚类

BMC Bioinformatics. 2011 Jun 30;12:271. doi: 10.1186/1471-2105-12-271.

Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.Cd-hit：一个用于对大量蛋白质或核苷酸序列进行聚类和比较的快速程序。

Bioinformatics. 2006 Jul 1;22(13):1658-9. doi: 10.1093/bioinformatics/btl158. Epub 2006 May 26.

Grammar-based distance in progressive multiple sequence alignment.渐进多序列比对中基于语法的距离

BMC Bioinformatics. 2008 Jul 10;9:306. doi: 10.1186/1471-2105-9-306.

hc-OTU: A Fast and Accurate Method for Clustering Operational Taxonomic Units Based on Homopolymer Compaction.hc-OTU：一种基于同源多聚体压缩的快速准确的操作分类单元聚类方法。

IEEE/ACM Trans Comput Biol Bioinform. 2018 Mar-Apr;15(2):441-451. doi: 10.1109/TCBB.2016.2535326. Epub 2016 Feb 26.

ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time.ESPRIT-Forest：在亚二次时间内对海量扩增子序列数据进行并行聚类

PLoS Comput Biol. 2017 Apr 24;13(4):e1005518. doi: 10.1371/journal.pcbi.1005518. eCollection 2017 Apr.

DySC: software for greedy clustering of 16S rRNA reads.DySC：用于 16S rRNA reads 贪心聚类的软件。

Bioinformatics. 2012 Aug 15;28(16):2182-3. doi: 10.1093/bioinformatics/bts355. Epub 2012 Jun 23.

Accurately clustering biological sequences in linear time by relatedness sorting.通过相关排序在线性时间内准确地对生物序列进行聚类。

Nat Commun. 2024 Apr 8;15(1):3047. doi: 10.1038/s41467-024-47371-9.

引用本文的文献

A toolbox of machine learning software to support microbiome analysis.一个支持微生物组分析的机器学习软件工具箱。

Front Microbiol. 2023 Nov 22;14:1250806. doi: 10.3389/fmicb.2023.1250806. eCollection 2023.

Core endophytic mycobiome in and its relation to Dutch elm disease resistance.榆树的核心内生真菌群落及其与荷兰榆树病抗性的关系。

Front Plant Sci. 2023 Feb 28;14:1125942. doi: 10.3389/fpls.2023.1125942. eCollection 2023.

Comparison of Methods for Picking the Operational Taxonomic Units From Amplicon Sequences.从扩增子序列中挑选操作分类单元的方法比较

Front Microbiol. 2021 Mar 24;12:644012. doi: 10.3389/fmicb.2021.644012. eCollection 2021.

A critical analysis of state-of-the-art metagenomics OTU clustering algorithms.对最先进的宏基因组 OTU 聚类算法的批判性分析。

J Biosci. 2019 Dec;44(6).

Clinical and Genomic Characterization of Recurrent Enterococcal Bloodstream Infection in Patients With Acute Leukemia.急性白血病患者复发性肠球菌血流感染的临床与基因组特征

Open Forum Infect Dis. 2018 May 5;5(6):ofy107. doi: 10.1093/ofid/ofy107. eCollection 2018 Jun.

A clinician's guide to microbiome analysis.临床医生微生物组分析指南。

Nat Rev Gastroenterol Hepatol. 2017 Oct;14(10):585-595. doi: 10.1038/nrgastro.2017.97. Epub 2017 Aug 9.

Alignment-free inference of hierarchical and reticulate phylogenomic relationships.基于无比对的方法推断系统发生的分支和网状结构关系。

Brief Bioinform. 2019 Mar 22;20(2):426-435. doi: 10.1093/bib/bbx067.

From reads to operational taxonomic units: an ensemble processing pipeline for MiSeq amplicon sequencing data.从读取到可操作分类单元：用于MiSeq扩增子测序数据的集成处理流程

Gigascience. 2017 Feb 1;6(2):1-10. doi: 10.1093/gigascience/giw017.

Alignment-free microbial phylogenomics under scenarios of sequence divergence, genome rearrangement and lateral genetic transfer.无比对的微生物系统发生基因组学研究在序列分歧、基因组重排和水平基因转移情景下的应用。

Sci Rep. 2016 Jul 1;6:28970. doi: 10.1038/srep28970.

Toward accurate molecular identification of species in complex environmental samples: testing the performance of sequence filtering and clustering methods.迈向复杂环境样本中物种的准确分子鉴定：测试序列过滤和聚类方法的性能

Ecol Evol. 2015 Jun;5(11):2252-66. doi: 10.1002/ece3.1497. Epub 2015 May 13.

本文引用的文献

Search and clustering orders of magnitude faster than BLAST.比 BLAST 快几个数量级的搜索和聚类。

Bioinformatics. 2010 Oct 1;26(19):2460-1. doi: 10.1093/bioinformatics/btq461. Epub 2010 Aug 12.

Sequence embedding for fast construction of guide trees for multiple sequence alignment.用于快速构建多序列比对引导树的序列嵌入

Algorithms Mol Biol. 2010 May 14;5:21. doi: 10.1186/1748-7188-5-21.

Bacterial community variation in human body habitats across space and time.人体不同空间和时间栖息地的细菌群落变化。

Science. 2009 Dec 18;326(5960):1694-7. doi: 10.1126/science.1177486. Epub 2009 Nov 5.

Analysis and comparison of very large metagenomes with fast clustering and functional annotation.快速聚类和功能注释的超大宏基因组分析与比较。

BMC Bioinformatics. 2009 Oct 28;10:359. doi: 10.1186/1471-2105-10-359.

Grammar-based distance in progressive multiple sequence alignment.渐进多序列比对中基于语法的距离

BMC Bioinformatics. 2008 Jul 10;9:306. doi: 10.1186/1471-2105-9-306.

Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.Cd-hit：一个用于对大量蛋白质或核苷酸序列进行聚类和比较的快速程序。

Bioinformatics. 2006 Jul 1;22(13):1658-9. doi: 10.1093/bioinformatics/btl158. Epub 2006 May 26.

Utilization of the relative complexity measure to construct a phylogenetic tree for fungi.利用相对复杂性度量构建真菌系统发育树。

Mycol Res. 2004 Feb;108(Pt 2):117-25. doi: 10.1017/s0953756203009079.

A new sequence distance measure for phylogenetic tree construction.一种用于构建系统发育树的新序列距离度量方法。

Bioinformatics. 2003 Nov 1;19(16):2122-30. doi: 10.1093/bioinformatics/btg295.

Tolerating some redundancy significantly speeds up clustering of large protein databases.容忍一定程度的冗余可显著加快大型蛋白质数据库的聚类速度。

Bioinformatics. 2002 Jan;18(1):77-82. doi: 10.1093/bioinformatics/18.1.77.

Language trees and zipping.语言树与压缩

Phys Rev Lett. 2002 Jan 28;88(4):048702. doi: 10.1103/PhysRevLett.88.048702. Epub 2002 Jan 8.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

基于语法的距离度量能够快速、准确地对大量 16S 序列进行聚类。

A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献