Suppr超能文献

基于语法的距离度量能够快速、准确地对大量 16S 序列进行聚类。

A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences.

机构信息

Department of Electrical Engineering, University of Nebraska-Lincoln, 209N WSEC, Lincoln, NE 68588-0511, USA.

出版信息

BMC Bioinformatics. 2010 Dec 17;11:601. doi: 10.1186/1471-2105-11-601.

Abstract

BACKGROUND

We propose a sequence clustering algorithm and compare the partition quality and execution time of the proposed algorithm with those of a popular existing algorithm. The proposed clustering algorithm uses a grammar-based distance metric to determine partitioning for a set of biological sequences. The algorithm performs clustering in which new sequences are compared with cluster-representative sequences to determine membership. If comparison fails to identify a suitable cluster, a new cluster is created.

RESULTS

The performance of the proposed algorithm is validated via comparison to the popular DNA/RNA sequence clustering approach, CD-HIT-EST, and to the recently developed algorithm, UCLUST, using two different sets of 16S rDNA sequences from 2,255 genera. The proposed algorithm maintains a comparable CPU execution time with that of CD-HIT-EST which is much slower than UCLUST, and has successfully generated clusters with higher statistical accuracy than both CD-HIT-EST and UCLUST. The validation results are especially striking for large datasets.

CONCLUSIONS

We introduce a fast and accurate clustering algorithm that relies on a grammar-based sequence distance. Its statistical clustering quality is validated by clustering large datasets containing 16S rDNA sequences.

摘要

背景

我们提出了一种序列聚类算法,并将其与一种流行的现有算法的分区质量和执行时间进行了比较。所提出的聚类算法使用基于语法的距离度量来确定一组生物序列的分区。该算法执行聚类,其中新序列与聚类代表序列进行比较以确定成员身份。如果比较无法识别合适的聚类,则创建一个新的聚类。

结果

通过与流行的 DNA/RNA 序列聚类方法 CD-HIT-EST 以及最近开发的 UCLUST 算法在来自 2255 个属的两个不同的 16S rDNA 序列集上进行比较,验证了所提出算法的性能。所提出的算法与 CD-HIT-EST 的 CPU 执行时间相当,而 CD-HIT-EST 比 UCLUST 慢得多,并且成功生成了比 CD-HIT-EST 和 UCLUST 都具有更高统计准确性的聚类。验证结果对于大型数据集尤其引人注目。

结论

我们引入了一种快速准确的聚类算法,该算法依赖于基于语法的序列距离。通过对包含 16S rDNA 序列的大型数据集进行聚类,验证了其统计聚类质量。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0cb9/3022630/13279b6b24be/1471-2105-11-601-1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验