Jiang Yang, Li Bi-Qing, Zhang Yuchao, Feng Yuan-Ming, Gao Yu-Fei, Zhang Ning, Cai Yu-Dong
Department of Surgery, China-Japan Union Hospital of Jilin University, Changchun, P. R. China.
PLoS One. 2013 Jun 21;8(6):e66678. doi: 10.1371/journal.pone.0066678. Print 2013.
Most of pyruvoyl-dependent proteins observed in prokaryotes and eukaryotes are critical regulatory enzymes, which are primary targets of inhibitors for anti-cancer and anti-parasitic therapy. These proteins undergo an autocatalytic, intramolecular self-cleavage reaction in which a covalently bound pyruvoyl group is generated on a conserved serine residue. Traditional detections of the modified serine sites are performed by experimental approaches, which are often labor-intensive and time-consuming. In this study, we initiated in an attempt for the computational predictions of such serine sites with Feature Selection based on a Random Forest. Since only a small number of experimentally verified pyruvoyl-modified proteins are collected in the protein database at its current version, we only used a small dataset in this study. After removing proteins with sequence identities >60%, a non-redundant dataset was generated and was used, which contained only 46 proteins, with one pyruvoyl serine site for each protein. Several types of features were considered in our method including PSSM conservation scores, disorders, secondary structures, solvent accessibilities, amino acid factors and amino acid occurrence frequencies. As a result, a pretty good performance was achieved in our dataset. The best 100.00% accuracy and 1.0000 MCC value were obtained from the training dataset, and 93.75% accuracy and 0.8441 MCC value from the testing dataset. The optimal feature set contained 9 features. Analysis of the optimal feature set indicated the important roles of some specific features in determining the pyruvoyl-group-serine sites, which were consistent with several results of earlier experimental studies. These selected features may shed some light on the in-depth understanding of the mechanism of the post-translational self-maturation process, providing guidelines for experimental validation. Future work should be made as more pyruvoyl-modified proteins are found and the method should be evaluated on larger datasets. At last, the predicting software can be downloaded from http://www.nkbiox.com/sub/pyrupred/index.html.
在原核生物和真核生物中观察到的大多数依赖丙酮酰的蛋白质都是关键的调节酶,它们是抗癌和抗寄生虫治疗中抑制剂的主要作用靶点。这些蛋白质会经历一种自催化的分子内自我切割反应,在一个保守的丝氨酸残基上生成一个共价结合的丙酮酰基团。传统上对修饰丝氨酸位点的检测是通过实验方法进行的,这些方法通常既费力又耗时。在本研究中,我们尝试基于随机森林的特征选择对这些丝氨酸位点进行计算预测。由于在当前版本的蛋白质数据库中仅收集到少量经过实验验证的丙酮酰修饰蛋白,因此在本研究中我们仅使用了一个小数据集。去除序列同一性>60%的蛋白质后,生成并使用了一个非冗余数据集,该数据集仅包含46种蛋白质,每种蛋白质有一个丙酮酰丝氨酸位点。我们的方法考虑了几种类型的特征,包括位置特异性得分矩阵(PSSM)保守分数、无序性、二级结构、溶剂可及性、氨基酸因子和氨基酸出现频率。结果,我们的数据集中取得了相当不错的性能。训练数据集获得了100.00%的最佳准确率和1.0000的马修斯相关系数(MCC)值,测试数据集获得了93.75%的准确率和0.8441的MCC值。最优特征集包含9个特征。对最优特征集的分析表明,一些特定特征在确定丙酮酰基团丝氨酸位点中起着重要作用,这与早期一些实验研究的结果一致。这些选定的特征可能有助于深入理解翻译后自我成熟过程的机制,为实验验证提供指导。随着发现更多的丙酮酰修饰蛋白,应开展进一步的工作,并在更大的数据集上对该方法进行评估。最后,预测软件可从http://www.nkbiox.com/sub/pyrupred/index.html下载。