Arai Masafumi, Fukushi Takafumi, Satake Masanobu, Shimizu Toshio
Department of Electronic and Information System Engineering, Faculty of Science and Technology, Hirosaki University, Japan.
Comput Biol Chem. 2005 Oct;29(5):379-87. doi: 10.1016/j.compbiolchem.2005.08.004. Epub 2005 Oct 6.
We performed a proteome-wide survey of the domain architectures in single-spanning transmembrane (TM) proteins (single-spannings) from 87 sequenced prokaryotic (Bacterial and Archaean) genomes by assigning Pfam domains to their N-tail and C-tail loops. Out of 14,625 single-spannings, 3,516 sequences have at least one domain assigned, and no domains were assigned to 7,850, with the remaining 3,259 with less reliable assignment. In the domain-assigned sequences, 3116 sequences are with at most two domains, and the other 400 sequences with more than two. The assigned domains distribute over 651 Pfam families, which account for 11.4% of the total Pfam-A families. Among the 651 families are mostly soluble-protein-originated ones, but only 21 families are unique to TM proteins. The occurrence frequency of the individual domain families follows a power-law, that is, 264 families occur only once, 106 just twice, and the families appeared more than 30 times are counted by only 39. It is found that the great majority of the sequences having one or two domains are of the type II topology with the C-tail loop containing domains on it. On the contrary, the N-tail loop of the same type topology seldom carries domains. Importantly, the assigned domains are always found on the tail loops longer than 60 residues, even for the small domains with less than 30 residues. There are still as many as 5,800 sequences without assigned domains in spite of having at least one long tail, on which no less than 1,000 novel domain families are expected most likely to lie concealed unknown yet. We also investigated the domain arrangement preference and the domain family combination patterns in 'singlets' (single-spannings with one assigned domain) and 'doublets' (with two domains).
我们通过将Pfam结构域分配给87个已测序的原核生物(细菌和古生菌)基因组中的单跨膜(TM)蛋白(单跨膜蛋白)的N端和C端环,对其结构域架构进行了全蛋白质组范围的调查。在14625个单跨膜蛋白中,3516个序列至少有一个已分配的结构域,7850个未分配结构域,其余3259个分配不太可靠。在已分配结构域的序列中,3116个序列最多有两个结构域,另外400个序列有两个以上结构域。已分配的结构域分布在651个Pfam家族中,占Pfam-A家族总数的11.4%。在这651个家族中,大多数是起源于可溶性蛋白的家族,但只有21个家族是TM蛋白特有的。各个结构域家族的出现频率遵循幂律,即264个家族只出现一次,106个家族只出现两次,出现超过30次的家族只有39个。结果发现,绝大多数具有一个或两个结构域的序列属于II型拓扑结构,其C端环上含有结构域。相反,相同类型拓扑结构的N端环很少携带结构域。重要的是,即使是长度小于30个残基的小结构域,已分配的结构域也总是出现在长度超过60个残基的尾环上。尽管有至少一个长尾巴,但仍有多达5800个序列未分配结构域,最有可能隐藏着不少于1000个未知的新结构域家族。我们还研究了“单结构域蛋白”(具有一个已分配结构域的单跨膜蛋白)和“双结构域蛋白”(具有两个结构域的单跨膜蛋白)中的结构域排列偏好和结构域家族组合模式。