Austin Ryan S, Provart Nicholas J, Cutler Sean R
Department of Cell & Systems Biology, University of Toronto, Toronto, ON, Canada.
BMC Genomics. 2007 Jun 26;8:191. doi: 10.1186/1471-2164-8-191.
The carboxy termini of proteins are a frequent site of activity for a variety of biologically important functions, ranging from post-translational modification to protein targeting. Several short peptide motifs involved in protein sorting roles and dependent upon their proximity to the C-terminus for proper function have already been characterized. As a limited number of such motifs have been identified, the potential exists for genome-wide statistical analysis and comparative genomics to reveal novel peptide signatures functioning in a C-terminal dependent manner. We have applied a novel methodology to the prediction of C-terminal-anchored peptide motifs involving a simple z-statistic and several techniques for improving the signal-to-noise ratio.
We examined the statistical over-representation of position-specific C-terminal tripeptides in 7 eukaryotic proteomes. Sequence randomization models and simple-sequence masking were applied to the successful reduction of background noise. Similarly, as C-terminal homology among members of large protein families may artificially inflate tripeptide counts in an irrelevant and obfuscating manner, gene-family clustering was performed prior to the analysis in order to assess tripeptide over-representation across protein families as opposed to across all proteins. Finally, comparative genomics was used to identify tripeptides significantly occurring in multiple species. This approach has been able to predict, to our knowledge, all C-terminally anchored targeting motifs present in the literature. These include the PTS1 peroxisomal targeting signal (SKL*), the ER-retention signal (K/HDEL*), the ER-retrieval signal for membrane bound proteins (KKxx*), the prenylation signal (CC*) and the CaaX box prenylation motif. In addition to a high statistical over-representation of these known motifs, a collection of significant tripeptides with a high propensity for biological function exists between species, among kingdoms and across eukaryotes. Motifs of note include a serine-acidic peptide (DSD*) as well as several lysine enriched motifs found in nearly all eukaryotic genomes examined.
We have successfully generated a high confidence representation of eukaryotic motifs anchored at the C-terminus. A high incidence of true-positives in our results suggests that several previously unidentified tripeptide patterns are strong candidates for representing novel peptide motifs of a widely employed nature in the C-terminal biology of eukaryotes. Our application of comparative genomics, statistical over-representation and the adjustment for protein family homology has generated several hypotheses concerning the C-terminal topology as it pertains to sorting and potential protein interaction signals. This approach to background reduction could be expanded for application to protein motif prediction in the protein interior. A parallel N-terminal analysis is presented as supplementary data.
蛋白质的羧基末端是多种生物学重要功能的常见活性位点,从翻译后修饰到蛋白质靶向定位。已经鉴定出了几种参与蛋白质分选作用且其功能依赖于与C末端的接近程度的短肽基序。由于已鉴定出的此类基序数量有限,因此存在进行全基因组统计分析和比较基因组学以揭示以C末端依赖性方式发挥作用的新型肽特征的可能性。我们应用了一种新颖的方法来预测C末端锚定的肽基序,该方法涉及一个简单的z统计量和几种提高信噪比的技术。
我们研究了7个真核生物蛋白质组中位置特异性C末端三肽的统计学过度代表性。应用序列随机化模型和简单序列屏蔽成功降低了背景噪声。同样,由于大型蛋白质家族成员之间的C末端同源性可能会以不相关且混淆的方式人为地夸大三肽计数,因此在分析之前进行了基因家族聚类,以评估跨蛋白质家族而非所有蛋白质的三肽过度代表性。最后,使用比较基因组学来鉴定在多个物种中显著出现的三肽。据我们所知,这种方法能够预测文献中存在的所有C末端锚定的靶向基序。这些包括PTS1过氧化物酶体靶向信号(SKL*)、内质网滞留信号(K/HDEL*)、膜结合蛋白的内质网回收信号(KKxx*)、异戊二烯化信号(CC*)和CaaX盒异戊二烯化基序。除了这些已知基序的高度统计学过度代表性外,在物种之间、界之间和真核生物之间还存在一系列具有高生物学功能倾向的显著三肽。值得注意的基序包括一个丝氨酸-酸性肽(DSD*)以及在几乎所有检测的真核生物基因组中发现的几个富含赖氨酸的基序。
我们成功生成了一个高可信度的真核生物C末端锚定基序的表示。我们结果中的高真阳性发生率表明,几种先前未鉴定出的三肽模式是代表真核生物C末端生物学中广泛存在的新型肽基序的有力候选者。我们对比较基因组学、统计学过度代表性和蛋白质家族同源性调整的应用产生了几个关于C末端拓扑结构的假设,这些假设与分选和潜在的蛋白质相互作用信号有关。这种减少背景的方法可以扩展应用于蛋白质内部的蛋白质基序预测。作为补充数据给出了一个平行的N末端分析。