去除大型蛋白质序列集合中的近邻冗余。

Removing near-neighbour redundancy from large protein sequence collections.

作者信息

Holm L, Sander C

机构信息

EMBL-EBI, Cambridge CB10 1SD, UK.

出版信息

Bioinformatics. 1998 Jun;14(5):423-9. doi: 10.1093/bioinformatics/14.5.423.

DOI:10.1093/bioinformatics/14.5.423

PMID:9682055

Abstract

MOTIVATION

To maximize the chances of biological discovery, homology searching must use an up-to-date collection of sequences. However, the available sequence databases are growing rapidly and are partially redundant in content. This leads to increasing strain on CPU resources and decreasing density of first-hand annotation.

RESULTS

These problems are addressed by clustering closely similar sequences to yield a covering of sequence space by a representative subset of sequences. No pair of sequences in the representative set has >90% mutual sequence identity. The representative set is derived by an exhaustive search for close similarities in the sequence database in which the need for explicit sequence alignment is significantly reduced by applying deca- and pentapeptide composition filters. The algorithm was applied to the union of the Swissprot, Swissnew, Trembl, Tremblnew, Genbank, PIR, Wormpep and PDB databases. The all-against-all comparison required to generate a representative set at 90% sequence identity was accomplished in 2 days CPU time, and the removal of fragments and close similarities yielded a size reduction of 46%, from 260 000 unique sequences to 140 000 representative sequences. The practical implications are (i) faster homology searches using, for example, Fasta or Blast, and (ii) unified annotation for all sequences clustered around a representative. As tens of thousands of sequence searches are performed daily world-wide, appropriate use of the non-redundant database can lead to major savings in computer resources, without loss of efficacy.

AVAILABILITY

A regularly updated non-redundant protein sequence database (nrdb90), a server for homology searches against nrdb90, and a Perl script (nrdb90.pl) implementing the algorithm are available for academic use from http://www.embl-ebi.ac. uk/holm/nrdb90.

CONTACT

holm@embl-ebi.ac.uk

摘要

动机

为了最大化生物学发现的机会，同源性搜索必须使用最新的序列集合。然而，现有的序列数据库增长迅速且内容部分冗余。这导致CPU资源压力增大，以及一手注释的密度降低。

结果

通过对高度相似的序列进行聚类，以由代表性序列子集覆盖序列空间来解决这些问题。代表性集合中没有一对序列具有大于90%的相互序列同一性。代表性集合是通过在序列数据库中详尽搜索紧密相似性而得出的，其中通过应用十肽和五肽组成过滤器显著减少了对显式序列比对的需求。该算法应用于Swissprot、Swissnew、Trembl、Tremblnew、Genbank、PIR、Wormpep和PDB数据库的合集。在90%序列同一性下生成代表性集合所需的全对全比较在2天的CPU时间内完成，去除片段和紧密相似性后大小减少了46%，从260000个独特序列减少到140000个代表性序列。实际意义在于：（i）使用例如Fasta或Blast进行更快的同源性搜索，以及（ii）对围绕一个代表性序列聚类的所有序列进行统一注释。由于全球每天要进行数以万计的序列搜索，合理使用非冗余数据库可大幅节省计算机资源，且不损失效率。