基于搜索模块数据库的蛋白质家族分类。

Protein family classification based on searching a database of blocks.

作者信息

Henikoff S, Henikoff J G

机构信息

Howard Hughes Medical Institute, Fred Hutchinson Cancer Research Center, Seattle, Washington 98104.

出版信息

Genomics. 1994 Jan 1;19(1):97-107. doi: 10.1006/geno.1994.1018.

DOI:10.1006/geno.1994.1018

PMID:8188249

Abstract

The most highly conserved regions of proteins can be represented as "blocks" of locally aligned sequence segments. Previously, an automated system was introduced to generate a database of blocks that is searched for local similarities using a sequence query. Here, we describe a method for searching this database that can also reveal significant global similarities. Local and global alignments are scored independently, so they can be used in concert to infer homology. A set of 7082 diverse sequences not represented in the database provided queries for testing this approach. The resulting distributions of scores led to guidelines for interpretation of search data and to the classification of 289 uncatalogued sequences into known groups. Thirty-eight of these relationships appear to be new discoveries. We also show how searching a database of blocks can be used to detect repeated domains and to find distinct cross-family relationships that were missed in searches of sequence databases.

摘要

蛋白质中保守性最高的区域可以表示为局部比对序列片段的“模块”。此前，已引入一个自动化系统来生成一个模块数据库，该数据库可通过序列查询来搜索局部相似性。在此，我们描述一种搜索该数据库的方法，该方法还能揭示显著的全局相似性。局部比对和全局比对分别计分，因此它们可以协同使用以推断同源性。一组未包含在数据库中的7082个不同序列为测试该方法提供了查询序列。所得的分数分布为搜索数据的解释提供了指导方针，并将289个未分类序列分类到已知组中。其中38种关系似乎是新发现。我们还展示了如何通过搜索模块数据库来检测重复结构域，并找到在序列数据库搜索中遗漏的不同家族间的关系。