Koonin E V, Tatusov R L, Rudd K E
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
Proc Natl Acad Sci U S A. 1995 Dec 5;92(25):11921-5. doi: 10.1073/pnas.92.25.11921.
A computer analysis of 2328 protein sequences comprising about 60% of the Escherichia coli gene products was performed using methods for database screening with individual sequences and alignment blocks. A high fraction of E. coli proteins--86%--shows significant sequence similarity to other proteins in current databases; about 70% show conservation at least at the level of distantly related bacteria, and about 40% contain ancient conserved regions (ACRs) shared with eukaryotic or Archaeal proteins. For > 90% of the E. coli proteins, either functional information or sequence similarity, or both, are available. Forty-six percent of the E. coli proteins belong to 299 clusters of paralogs (intraspecies homologs) defined on the basis of pairwise similarity. Another 10% could be included in 70 superclusters using motif detection methods. The majority of the clusters contain only two to four members. In contrast, nearly 25% of all E. coli proteins belong to the four largest superclusters--namely, permeases, ATPases and GTPases with the conserved "Walker-type" motif, helix-turn-helix regulatory proteins, and NAD(FAD)-binding proteins. We conclude that bacterial protein sequences generally are highly conserved in evolution, with about 50% of all ACR-containing protein families represented among the E. coli gene products. With the current sequence databases and methods of their screening, computer analysis yields useful information on the functions and evolutionary relationships of the vast majority of genes in a bacterial genome. Sequence similarity with E. coli proteins allows the prediction of functions for a number of important eukaryotic genes, including several whose products are implicated in human diseases.
利用针对单个序列和比对模块的数据库筛选方法,对包含约60%大肠杆菌基因产物的2328个蛋白质序列进行了计算机分析。很大一部分大肠杆菌蛋白质(86%)与当前数据库中的其他蛋白质显示出显著的序列相似性;约70%的蛋白质至少在远缘细菌水平上表现出保守性,约40%含有与真核生物或古细菌蛋白质共有的古老保守区域(ACR)。对于90%以上的大肠杆菌蛋白质,功能信息或序列相似性,或两者都有。46%的大肠杆菌蛋白质属于基于成对相似性定义的299个旁系同源物(种内同源物)簇。另外10%可以通过基序检测方法纳入70个超级簇。大多数簇仅包含两到四个成员。相比之下,所有大肠杆菌蛋白质中近25%属于四个最大的超级簇,即具有保守“沃克型”基序的通透酶、ATP酶和GTP酶、螺旋-转角-螺旋调节蛋白以及NAD(FAD)结合蛋白。我们得出结论,细菌蛋白质序列在进化过程中通常高度保守,大肠杆菌基因产物中约50%的含ACR蛋白质家族都有代表。利用当前的序列数据库及其筛选方法,计算机分析能够得出关于细菌基因组中绝大多数基因的功能和进化关系的有用信息。与大肠杆菌蛋白质的序列相似性有助于预测一些重要真核基因的功能,包括一些其产物与人类疾病有关的基因。