Computer Department, Jingdezhen Ceramic Institute, Jingdezhen, 333403, China.
Department of Computer Science and Bond Life Science Center, University of Missouri, Columbia, MO, USA.
Mol Inform. 2017 May;36(5-6). doi: 10.1002/minf.201600010. Epub 2016 May 12.
Protein phosphorylation plays a critical role in human body by altering the structural conformation of a protein, causing it to become activated/deactivated, or functional modification. Given an uncharacterized protein sequence, can we predict whether it may be phosphorylated or may not? This is no doubt a very meaningful problem for both basic research and drug development. Unfortunately, to our best knowledge, so far no high throughput bioinformatics tool whatsoever has been developed to address such a very basic but important problem due to its extremely complexity and lacking sufficient training data. Here we proposed a predictor called iPhos-PseEvo by (1) incorporating the protein sequence evolutionary information into the general pseudo amino acid composition (PseAAC) via the grey system theory, (2) balancing out the skewed training datasets by the asymmetric bootstrap approach, and (3) constructing an ensemble predictor by fusing an array of individual random forest classifiers thru a voting system. Rigorous jackknife tests have indicated that very promising success rates have been achieved by iPhos-PseEvo even for such a difficult problem. A user-friendly web-server for iPhos-PseEvo has been established at http://www.jci-bioinfo.cn/iPhos-PseEvo, by which users can easily obtain their desired results without the need to go through the complicated mathematical equations involved. It has not escaped our notice that the formulation and approach presented here can be used to analyze many other problems in protein science as well.
蛋白质磷酸化通过改变蛋白质的结构构象来发挥关键作用,导致其被激活/失活,或进行功能修饰。对于一个未被描述的蛋白质序列,我们能否预测它是否可能被磷酸化?这无疑是一个对于基础研究和药物研发都非常有意义的问题。不幸的是,据我们所知,由于其极其复杂性和缺乏足够的训练数据,迄今为止,还没有开发出任何高通量的生物信息学工具来解决这样一个非常基本但又非常重要的问题。在这里,我们通过(1)将蛋白质序列进化信息通过灰色系统理论纳入通用伪氨基酸组成(PseAAC),(2)通过非对称自举方法平衡偏斜的训练数据集,以及(3)通过投票系统融合一系列单独的随机森林分类器来构建集成预测器,提出了一个名为 iPhos-PseEvo 的预测器。严格的交叉验证测试表明,即使对于这样一个困难的问题,iPhos-PseEvo 也取得了非常有前途的成功率。我们已经在 http://www.jci-bioinfo.cn/iPhos-PseEvo 上建立了一个 iPhos-PseEvo 的用户友好型网络服务器,用户可以轻松地获得他们所需的结果,而无需经历涉及的复杂数学方程。我们注意到,这里提出的公式和方法也可以用于分析蛋白质科学中的许多其他问题。