Gana Rajaram, Rao Shruti, Huang Hongzhan, Wu Cathy, Vasudevan Sona
Department of Biostatistics and Bioinformatics, Georgetown University Medical Center, Washington, DC 20007, USA.
BMC Struct Biol. 2013 Apr 25;13:6. doi: 10.1186/1472-6807-13-6.
The post-genomic era poses several challenges. The biggest is the identification of biochemical function for protein sequences and structures resulting from genomic initiatives. Most sequences lack a characterized function and are annotated as hypothetical or uncharacterized. While homology-based methods are useful, and work well for sequences with sequence identities above 50%, they fail for sequences in the twilight zone (<30%) of sequence identity. For cases where sequence methods fail, structural approaches are often used, based on the premise that structure preserves function for longer evolutionary time-frames than sequence alone. It is now clear that no single method can be used successfully for functional inference. Given the growing need for functional assignments, we describe here a systematic new approach, designated ligand-centric, which is primarily based on analysis of ligand-bound/unbound structures in the PDB. Results of applying our approach to S-adenosyl-L-methionine (SAM) binding proteins are presented.
Our analysis included 1,224 structures that belong to 172 unique families of the Protein Information Resource Superfamily system. Our ligand-centric approach was divided into four levels: residue, protein/domain, ligand, and family levels. The residue level included the identification of conserved binding site residues based on structure-guided sequence alignments of representative members of a family, and the identification of conserved structural motifs. The protein/domain level included structural classification of proteins, Pfam domains, domain architectures, and protein topologies. The ligand level included ligand conformations, ribose sugar puckering, and the identification of conserved ligand-atom interactions. The family level included phylogenetic analysis.
We found that SAM bound to a total of 18 different fold types (I-XVIII). We identified 4 new fold types and 11 additional topological arrangements of strands within the well-studied Rossmann fold Methyltransferases (MTases). This extends the existing structural classification of SAM binding proteins. A striking correlation between fold type and the conformation of the bound SAM (classified as types) was found across the 18 fold types. Several site-specific rules were created for the assignment of functional residues to families and proteins that do not have a bound SAM or a solved structure.
后基因组时代带来了诸多挑战。最大的挑战是确定基因组计划所产生的蛋白质序列和结构的生化功能。大多数序列缺乏已明确的功能,被注释为假设的或未表征的。虽然基于同源性的方法很有用,并且对于序列同一性高于50%的序列效果良好,但对于处于序列同一性“模糊地带”(<30%)的序列却无能为力。对于序列方法失效的情况,通常会采用结构方法,其前提是在比单独序列更长的进化时间尺度上,结构保留功能。现在很清楚,没有一种单一的方法能够成功用于功能推断。鉴于对功能分配的需求不断增加,我们在此描述一种系统的新方法,称为以配体为中心的方法,该方法主要基于对蛋白质数据银行(PDB)中配体结合/未结合结构的分析。展示了将我们的方法应用于S -腺苷 - L -甲硫氨酸(SAM)结合蛋白的结果。
我们的分析包括属于蛋白质信息资源超家族系统172个独特家族的1224个结构。我们以配体为中心的方法分为四个层次:残基、蛋白质/结构域、配体和家族层次。残基层次包括基于家族代表性成员的结构引导序列比对确定保守结合位点残基,以及确定保守结构基序。蛋白质/结构域层次包括蛋白质的结构分类、Pfam结构域、结构域架构和蛋白质拓扑结构。配体层次包括配体构象、核糖糖环构象以及确定保守的配体 - 原子相互作用。家族层次包括系统发育分析。
我们发现SAM与总共18种不同的折叠类型(I - XVIII)结合。我们确定了4种新的折叠类型以及在研究充分的Rossmann折叠甲基转移酶(MTases)中另外11种链的拓扑排列。这扩展了SAM结合蛋白的现有结构分类。在这18种折叠类型中发现了折叠类型与结合的SAM构象(分类为类型)之间的显著相关性。为将功能残基分配给没有结合SAM或已解析结构的家族和蛋白质制定了几条位点特异性规则。