针对大量序列的一种新聚类方法的计算空间缩减与并行化

Computational space reduction and parallelization of a new clustering approach for large groups of sequences.

作者信息

Trelles O, Andrade M A, Valencia A, Zapata E L, Carazo J M

机构信息

Computer Architecture Department, University of Malaga, 29017 Malaga, Spain.

出版信息

Bioinformatics. 1998 Jun;14(5):439-51. doi: 10.1093/bioinformatics/14.5.439.

DOI:10.1093/bioinformatics/14.5.439

PMID:9682057

Abstract

MOTIVATION

The explosive growth of the biological sequences databases stimulated by genome projects has modified the framework of several applications in the biological sequence analysis area. In most cases, this new scenario is characterized by studies on large sets of sequences, suggesting the need for effective and automatic methods for their clustering. A more effective clustering of the database could be followed by the application of common family analysis schemes to the groups so formed.

RESULTS

In this work, we present a new strategy to reduce the computational cost associated with the clustering of large sets of sequences which are expected to contain several families. The strategy is based on the grouping of the sequences into families by using a dynamic threshold on a pairwise sequence similarity criterion. Routine clustering of large data sets can now be done very efficiently. The method developed here achieves a computational space reduction of about an order of magnitude over more traditional ones of all-versus-all comparisons. The outcome of this approach produces family groupings that reproduce closely already accepted biological results. Our work includes a parallel implementation for distributed memory multiprocessors with a dynamic scheduling strategy for performance optimization.

AVAILABILITY

By anonymous ftp at ftp.ac.uma.es (/pub/ots/pCluster directory), or from our Web site http://www.cnb. uam.es/www/software/software_index.html

CONTACT

ots@ac.uma.es

摘要

动机

基因组计划推动了生物序列数据库的爆炸式增长，这改变了生物序列分析领域中多个应用的框架。在大多数情况下，这种新情况的特点是对大量序列进行研究，这表明需要有效且自动的聚类方法。对数据库进行更有效的聚类之后，可以将常见的家族分析方案应用于这样形成的组。

结果

在这项工作中，我们提出了一种新策略，以降低与预期包含多个家族的大量序列聚类相关的计算成本。该策略基于使用成对序列相似性标准的动态阈值将序列分组为家族。现在可以非常高效地对大数据集进行常规聚类。与更传统的全对全比较方法相比，这里开发的方法实现了约一个数量级的计算空间减少。这种方法的结果产生的家族分组与已被接受的生物学结果非常接近。我们的工作包括针对分布式内存多处理器的并行实现，以及用于性能优化的动态调度策略。

可用性

可通过匿名ftp从ftp.ac.uma.es（/pub/ots/pCluster目录）获取，或从我们的网站http://www.cnb.uam.es/www/software/software_index.html获取。

联系方式

ots@ac.uma.es

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

针对大量序列的一种新聚类方法的计算空间缩减与并行化

Computational space reduction and parallelization of a new clustering approach for large groups of sequences.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY

CONTACT

动机

结果

可用性

联系方式

相似文献

针对大量序列的一种新聚类方法的计算空间缩减与并行化

Computational space reduction and parallelization of a new clustering approach for large groups of sequences.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY

CONTACT

动机

结果

可用性

联系方式

相似文献