Department of Molecular Biology, Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America.
PLoS One. 2010 Dec 29;5(12):e14444. doi: 10.1371/journal.pone.0014444.
The increasing ability to generate large-scale, quantitative proteomic data has brought with it the challenge of analyzing such data to discover the sequence elements that underlie systems-level protein behavior. Here we show that short, linear protein motifs can be efficiently recovered from proteome-scale datasets such as sub-cellular localization, molecular function, half-life, and protein abundance data using an information theoretic approach. Using this approach, we have identified many known protein motifs, such as phosphorylation sites and localization signals, and discovered a large number of candidate elements. We estimate that ~80% of these are novel predictions in that they do not match a known motif in both sequence and biological context, suggesting that post-translational regulation of protein behavior is still largely unexplored. These predicted motifs, many of which display preferential association with specific biological pathways and non-random positioning in the linear protein sequence, provide focused hypotheses for experimental validation.
大规模、定量蛋白质组学数据的生成能力不断提高,这带来了分析这些数据以发现系统水平蛋白质行为的序列元素的挑战。在这里,我们展示了使用信息论方法可以从细胞内定位、分子功能、半衰期和蛋白质丰度等蛋白质组学规模数据集高效地回收短线性蛋白质基序。使用这种方法,我们已经鉴定了许多已知的蛋白质基序,如磷酸化位点和定位信号,并发现了大量候选元素。我们估计这些元素中有~80%是新的预测,因为它们在序列和生物学背景上都与已知的基序不匹配,这表明蛋白质行为的翻译后调控仍在很大程度上未被探索。这些预测的基序,其中许多与特定的生物途径具有优先相关性,并且在线性蛋白质序列中的位置是非随机的,为实验验证提供了有针对性的假说。