Department of Electrical Engineering, University of Nebraska-Lincoln, 209N WSEC, Lincoln, NE 68588-0511, USA.
BMC Bioinformatics. 2010 Dec 17;11:601. doi: 10.1186/1471-2105-11-601.
We propose a sequence clustering algorithm and compare the partition quality and execution time of the proposed algorithm with those of a popular existing algorithm. The proposed clustering algorithm uses a grammar-based distance metric to determine partitioning for a set of biological sequences. The algorithm performs clustering in which new sequences are compared with cluster-representative sequences to determine membership. If comparison fails to identify a suitable cluster, a new cluster is created.
The performance of the proposed algorithm is validated via comparison to the popular DNA/RNA sequence clustering approach, CD-HIT-EST, and to the recently developed algorithm, UCLUST, using two different sets of 16S rDNA sequences from 2,255 genera. The proposed algorithm maintains a comparable CPU execution time with that of CD-HIT-EST which is much slower than UCLUST, and has successfully generated clusters with higher statistical accuracy than both CD-HIT-EST and UCLUST. The validation results are especially striking for large datasets.
We introduce a fast and accurate clustering algorithm that relies on a grammar-based sequence distance. Its statistical clustering quality is validated by clustering large datasets containing 16S rDNA sequences.
我们提出了一种序列聚类算法,并将其与一种流行的现有算法的分区质量和执行时间进行了比较。所提出的聚类算法使用基于语法的距离度量来确定一组生物序列的分区。该算法执行聚类,其中新序列与聚类代表序列进行比较以确定成员身份。如果比较无法识别合适的聚类,则创建一个新的聚类。
通过与流行的 DNA/RNA 序列聚类方法 CD-HIT-EST 以及最近开发的 UCLUST 算法在来自 2255 个属的两个不同的 16S rDNA 序列集上进行比较,验证了所提出算法的性能。所提出的算法与 CD-HIT-EST 的 CPU 执行时间相当,而 CD-HIT-EST 比 UCLUST 慢得多,并且成功生成了比 CD-HIT-EST 和 UCLUST 都具有更高统计准确性的聚类。验证结果对于大型数据集尤其引人注目。
我们引入了一种快速准确的聚类算法,该算法依赖于基于语法的序列距离。通过对包含 16S rDNA 序列的大型数据集进行聚类,验证了其统计聚类质量。