• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

针对大量序列的一种新聚类方法的计算空间缩减与并行化

Computational space reduction and parallelization of a new clustering approach for large groups of sequences.

作者信息

Trelles O, Andrade M A, Valencia A, Zapata E L, Carazo J M

机构信息

Computer Architecture Department, University of Malaga, 29017 Malaga, Spain.

出版信息

Bioinformatics. 1998 Jun;14(5):439-51. doi: 10.1093/bioinformatics/14.5.439.

DOI:10.1093/bioinformatics/14.5.439
PMID:9682057
Abstract

MOTIVATION

The explosive growth of the biological sequences databases stimulated by genome projects has modified the framework of several applications in the biological sequence analysis area. In most cases, this new scenario is characterized by studies on large sets of sequences, suggesting the need for effective and automatic methods for their clustering. A more effective clustering of the database could be followed by the application of common family analysis schemes to the groups so formed.

RESULTS

In this work, we present a new strategy to reduce the computational cost associated with the clustering of large sets of sequences which are expected to contain several families. The strategy is based on the grouping of the sequences into families by using a dynamic threshold on a pairwise sequence similarity criterion. Routine clustering of large data sets can now be done very efficiently. The method developed here achieves a computational space reduction of about an order of magnitude over more traditional ones of all-versus-all comparisons. The outcome of this approach produces family groupings that reproduce closely already accepted biological results. Our work includes a parallel implementation for distributed memory multiprocessors with a dynamic scheduling strategy for performance optimization.

AVAILABILITY

By anonymous ftp at ftp.ac.uma.es (/pub/ots/pCluster directory), or from our Web site http://www.cnb. uam.es/www/software/software_index.html

CONTACT

ots@ac.uma.es

摘要

动机

基因组计划推动了生物序列数据库的爆炸式增长,这改变了生物序列分析领域中多个应用的框架。在大多数情况下,这种新情况的特点是对大量序列进行研究,这表明需要有效且自动的聚类方法。对数据库进行更有效的聚类之后,可以将常见的家族分析方案应用于这样形成的组。

结果

在这项工作中,我们提出了一种新策略,以降低与预期包含多个家族的大量序列聚类相关的计算成本。该策略基于使用成对序列相似性标准的动态阈值将序列分组为家族。现在可以非常高效地对大数据集进行常规聚类。与更传统的全对全比较方法相比,这里开发的方法实现了约一个数量级的计算空间减少。这种方法的结果产生的家族分组与已被接受的生物学结果非常接近。我们的工作包括针对分布式内存多处理器的并行实现,以及用于性能优化的动态调度策略。

可用性

可通过匿名ftp从ftp.ac.uma.es(/pub/ots/pCluster目录)获取,或从我们的网站http://www.cnb.uam.es/www/software/software_index.html获取。

联系方式

ots@ac.uma.es

相似文献

1
Computational space reduction and parallelization of a new clustering approach for large groups of sequences.针对大量序列的一种新聚类方法的计算空间缩减与并行化
Bioinformatics. 1998 Jun;14(5):439-51. doi: 10.1093/bioinformatics/14.5.439.
2
New phylogenetic venues opened by a novel implementation of the DNAml algorithm.DNAml算法的一种新实现方式开辟了新的系统发育研究途径。
Bioinformatics. 1998;14(6):544-5. doi: 10.1093/bioinformatics/14.6.544.
3
DIVCLUS: an automatic method in the GEANFAMMER package that finds homologous domains in single- and multi-domain proteins.DIVCLUS:GEANFAMMER软件包中的一种自动方法,可在单结构域和多结构域蛋白质中找到同源结构域。
Bioinformatics. 1998;14(2):144-50. doi: 10.1093/bioinformatics/14.2.144.
4
Automated protein sequence database classification. I. Integration of compositional similarity search, local similarity search, and multiple sequence alignment.自动化蛋白质序列数据库分类。I. 组成相似性搜索、局部相似性搜索和多序列比对的整合
Bioinformatics. 1998;14(2):164-73. doi: 10.1093/bioinformatics/14.2.164.
5
A set-theoretic approach to database searching and clustering.一种用于数据库搜索和聚类的集合论方法。
Bioinformatics. 1998 Jun;14(5):430-8. doi: 10.1093/bioinformatics/14.5.430.
6
Clustering protein sequences--structure prediction by transitive homology.蛋白质序列聚类——通过传递同源性进行结构预测
Bioinformatics. 2001 Oct;17(10):935-41. doi: 10.1093/bioinformatics/17.10.935.
7
Clustering of proximal sequence space for the identification of protein families.用于识别蛋白质家族的近端序列空间聚类
Bioinformatics. 2002 Jul;18(7):908-21. doi: 10.1093/bioinformatics/18.7.908.
8
Removing near-neighbour redundancy from large protein sequence collections.去除大型蛋白质序列集合中的近邻冗余。
Bioinformatics. 1998 Jun;14(5):423-9. doi: 10.1093/bioinformatics/14.5.423.
9
A RAPID algorithm for sequence database comparisons: application to the identification of vector contamination in the EMBL databases.一种用于序列数据库比较的快速算法:应用于识别EMBL数据库中的载体污染。
Bioinformatics. 1999 Feb;15(2):111-21. doi: 10.1093/bioinformatics/15.2.111.
10
Automated protein sequence database classification. II. Delineation Of domain boundaries from sequence similarities.自动化蛋白质序列数据库分类。II. 从序列相似性描绘结构域边界
Bioinformatics. 1998;14(2):174-87. doi: 10.1093/bioinformatics/14.2.174.