定义膜蛋白的折叠空间：CAMPS数据库。

Martin-Galiano Antonio J, Frishman Dmitrij

Department of Genome Oriented Bioinformatics, Technische Universität München, Freising, Germany.

Proteins. 2006 Sep 1;64(4):906-22. doi: 10.1002/prot.21081.

Recent progress in structure determination techniques has led to a significant growth in the number of known membrane protein structures, and the first structural genomics projects focusing on membrane proteins have been initiated, warranting an investigation of appropriate bioinformatics strategies for optimal structural target selection for these molecules. What determines a membrane protein fold? How many membrane structures need to be solved to provide sufficient structural coverage of the membrane protein sequence space? We present the CAMPS database (Computational Analysis of the Membrane Protein Space) containing almost 45,000 proteins with three or more predicted transmembrane helices (TMH) from 120 bacterial species. This large set of membrane proteins was subjected to single-linkage clustering using only sequence alignments covering at least 40% of the TMH present in a given family. This process yielded 266 sequence clusters with at least 15 members, roughly corresponding to membrane structural folds, sufficiently structurally homogeneous in terms of the variation of TMH number between individual sequences. These clusters were further subdivided into functionally homogeneous subclusters according to the COG (Clusters of Orthologous Groups) system as well as more stringently defined families sharing at least 30% identity. The CAMPS sequence clusters are thus designed to reflect three main levels of interest for structural genomics: fold, function, and modeling distance. We present a library of Hidden Markov Models (HMM) derived from sequence alignments of TMH at these three levels of sequence similarity. Given that 24 out of 266 clusters corresponding to membrane folds already have associated known structures, we estimate that 242 additional new structures, one for each remaining cluster, would provide structural coverage at the fold level of roughly 70% of prokaryotic membrane proteins belonging to the currently most populated families.

结构测定技术的最新进展已使已知膜蛋白结构的数量显著增加，并且已经启动了首个专注于膜蛋白的结构基因组学项目，这就需要研究合适的生物信息学策略，以便为这些分子进行最佳结构靶点选择。是什么决定了膜蛋白折叠？需要解析多少个膜结构才能为膜蛋白序列空间提供足够的结构覆盖？我们展示了CAMPS数据库（膜蛋白空间的计算分析），其中包含来自120种细菌的近45,000个具有三个或更多预测跨膜螺旋（TMH）的蛋白质。仅使用覆盖给定家族中至少40% TMH的序列比对，对这一大组膜蛋白进行单链聚类。这个过程产生了266个至少有15个成员的序列簇，大致对应于膜结构折叠，就单个序列之间TMH数量的变化而言，其结构足够均一。根据直系同源簇（COG）系统以及更严格定义的至少具有30% 同一性的家族，这些簇进一步细分为功能均一的子簇。因此，CAMPS序列簇旨在反映结构基因组学感兴趣的三个主要层面：折叠、功能和建模距离。我们展示了一个隐马尔可夫模型（HMM）库，该库源自这三个序列相似性层面的TMH序列比对。鉴于266个对应于膜折叠的簇中有24个已经有相关的已知结构，我们估计再增加242个新结构（每个剩余簇一个），将在折叠层面为属于当前数量最多家族的大约70% 的原核膜蛋白提供结构覆盖。