Durston Kirk K, Chiu David Ky, Wong Andrew Kc, Li Gary Cl
School of Computer Science, University of Guelph, 50 Stone Road East, Guelph, ON, N1G 2W1, Canada.
EURASIP J Bioinform Syst Biol. 2012 Jul 13;2012(1):8. doi: 10.1186/1687-4153-2012-8.
Much progress has been made in understanding the 3D structure of proteins using methods such as NMR and X-ray crystallography. The resulting 3D structures are extremely informative, but do not always reveal which sites and residues within the structure are of special importance. Recently, there are indications that multiple-residue, sub-domain structural relationships within the larger 3D consensus structure of a protein can be inferred from the analysis of the multiple sequence alignment data of a protein family. These intra-dependent clusters of associated sites are used to indicate hierarchical inter-residue relationships within the 3D structure. To reveal the patterns of associations among individual amino acids or sub-domain components within the structure, we apply a k-modes attribute (aligned site) clustering algorithm to the ubiquitin and transthyretin families in order to discover associations among groups of sites within the multiple sequence alignment. We then observe what these associations imply within the 3D structure of these two protein families.
The k-modes site clustering algorithm we developed maximizes the intra-group interdependencies based on a normalized mutual information measure. The clusters formed correspond to sub-structural components or binding and interface locations. Applying this data-directed method to the ubiquitin and transthyretin protein family multiple sequence alignments as a test bed, we located numerous interesting associations of interdependent sites. These clusters were then arranged into cluster tree diagrams which revealed four structural sub-domains within the single domain structure of ubiquitin and a single large sub-domain within transthyretin associated with the interface among transthyretin monomers. In addition, several clusters of mutually interdependent sites were discovered for each protein family, each of which appear to play an important role in the molecular structure and/or function.
Our results demonstrate that the method we present here using a k-modes site clustering algorithm based on interdependency evaluation among sites obtained from a sequence alignment of homologous proteins can provide significant insights into the complex, hierarchical inter-residue structural relationships within the 3D structure of a protein family.
利用核磁共振(NMR)和X射线晶体学等方法在理解蛋白质的三维结构方面已取得了很大进展。由此得到的三维结构信息极为丰富,但并不总能揭示结构中哪些位点和残基具有特殊重要性。最近,有迹象表明,通过对蛋白质家族的多序列比对数据进行分析,可以推断出蛋白质更大三维共有结构内的多残基、子域结构关系。这些相关位点的内部依赖簇用于指示三维结构内的层次化残基间关系。为了揭示结构内单个氨基酸或子域组件之间的关联模式,我们将一种k-模式属性(比对位点)聚类算法应用于泛素和转甲状腺素蛋白家族,以便在多序列比对中发现位点组之间的关联。然后我们观察这些关联在这两个蛋白质家族的三维结构中意味着什么。
我们开发的k-模式位点聚类算法基于归一化互信息度量最大化组内相互依赖性。形成的簇对应于子结构组件或结合及界面位置。将这种数据导向方法应用于泛素和转甲状腺素蛋白家族的多序列比对作为测试平台,我们定位了许多相互依赖位点的有趣关联。然后将这些簇排列成簇树图,揭示了泛素单域结构内的四个结构子域以及转甲状腺素蛋白内与转甲状腺素蛋白单体间界面相关的一个大子域。此外,为每个蛋白质家族发现了几个相互依赖位点的簇,每个簇似乎在分子结构和/或功能中发挥重要作用。
我们的结果表明,我们在此提出的使用基于从同源蛋白质序列比对获得的位点间相互依赖性评估的k-模式位点聚类算法的方法,可以为蛋白质家族三维结构内复杂的层次化残基间结构关系提供重要见解。