Plewczyński Dariusz, Tkacz Adrian, Godzik Adam, Rychlewski Leszek
BioInfoBank Institute, Limanowskiego 24A/16, 60-744 Poznań, Poland.
Cell Mol Biol Lett. 2005;10(1):73-89.
We describe a bioinformatics tool that can be used to predict the position of phosphorylation sites in proteins based only on sequence information. The method uses the support vector machine (SVM) statistical learning theory. The statistical models for phosphorylation by various types of kinases are built using a dataset of short (9-amino acid long) sequence fragments. The sequence segments are dissected around post-translationally modified sites of proteins that are on the current release of the Swiss-Prot database, and that were experimentally confirmed to be phosphorylated by any kinase. We represent them as vectors in a multidimensional abstract space of short sequence fragments. The prediction method is as follows. First, a given query protein sequence is dissected into overlapping short segments. All the fragments are then projected into the multidimensional space of sequence fragments via a collection of different representations. Those points are classified with pre-built statistical models (the SVM method with linear, polynomial and radial kernel functions) either as phosphorylated or inactive ones. The resulting list of plausible sites for phosphorylation by various types of kinases in the query protein is returned to the user. The efficiency of the method for each type of phosphorylation is estimated using leave-one-out tests and presented here. The sensitivities of the models can reach over 70%, depending on the type of kinase. The additional information from profile representations of short sequence fragments helps in gaining a higher degree of accuracy in some phosphorylation types. The further development of an automatic phosphorylation site annotation predictor based on our algorithm should yield a significant improvement when using statistical algorithms in order to quantify the results.
我们描述了一种生物信息学工具,该工具可仅基于序列信息来预测蛋白质中磷酸化位点的位置。该方法采用支持向量机(SVM)统计学习理论。利用短(9个氨基酸长)序列片段数据集构建了各种激酶磷酸化的统计模型。这些序列片段是围绕当前版本的Swiss-Prot数据库中蛋白质的翻译后修饰位点进行剖析的,并且这些位点已通过实验证实可被任何激酶磷酸化。我们将它们表示为短序列片段多维抽象空间中的向量。预测方法如下。首先,将给定的查询蛋白质序列剖析为重叠的短片段。然后,通过一系列不同的表示方式将所有片段投影到序列片段的多维空间中。使用预先构建的统计模型(具有线性、多项式和径向核函数的SVM方法)将这些点分类为磷酸化或非磷酸化的点。查询蛋白质中各种激酶可能的磷酸化位点的结果列表会返回给用户。使用留一法测试估计该方法对每种磷酸化类型的效率,并在此处展示。根据激酶类型,模型的灵敏度可达到70%以上。短序列片段的轮廓表示中的附加信息有助于在某些磷酸化类型中获得更高的准确性。基于我们的算法进一步开发自动磷酸化位点注释预测器,在使用统计算法量化结果时应会有显著改进。