Garg Aarti, Raghava Gajendra P S
Institute of Microbial Technology, Sector 39A, Chandigarh, India.
In Silico Biol. 2008;8(2):129-40.
Most of the prediction methods for secretory proteins require the presence of a correct N-terminal end of the preprotein for correct classification. As large scale genome sequencing projects sometimes assign the 5'-end of genes incorrectly, many proteins are encoded without the correct N-terminus leading to incorrect prediction. In this study, a systematic attempt has been made to predict secretory proteins irrespective of presence or absence of N-terminal signal peptides (also known as classical and non-classical secreted proteins respectively), using machine-learning techniques; artificial neural network (ANN) and support vector machine (SVM). We trained and tested our methods on a dataset of 3321 secretory and 3654 non-secretory mammalian proteins using five-fold cross-validation technique. First, ANN-based modules have been developed for predicting secretory proteins using 33 physico-chemical properties, amino acid composition and dipeptide composition and achieved accuracies of 73.1%, 76.1% and 77.1%, respectively. Similarly, SVM-based modules using 33 physico-chemical properties, amino acid, and dipeptide composition have been able to achieve accuracies of 77.4%, 79.4% and 79.9%, respectively. In addition, BLAST and PSI-BLAST modules designed for predicting secretory proteins based on similarity search achieved 23.4% and 26.9% accuracy, respectively. Finally, we developed a hybrid-approach by integrating amino acid and dipeptide composition based SVM modules and PSI-BLAST module that increased the accuracy to 83.2%, which is significantly better than individual modules. We also achieved high sensitivity of 60.4% with low value of 5% false positive predictions using hybrid module. A web server SRTpred has been developed based on above study for predicting classical and non-classical secreted proteins from whole sequence of mammalian proteins, which is available from http://www.imtech.res.in/raghava/srtpred/.
大多数分泌蛋白预测方法需要前体蛋白具有正确的N端才能进行正确分类。由于大规模基因组测序项目有时会错误地指定基因的5'端,许多蛋白质在编码时没有正确的N端,从而导致错误的预测。在本研究中,我们进行了系统的尝试,使用机器学习技术(人工神经网络(ANN)和支持向量机(SVM))来预测分泌蛋白,而不考虑N端信号肽的有无(分别称为经典分泌蛋白和非经典分泌蛋白)。我们使用五折交叉验证技术,在一个包含3321个分泌性和3654个非分泌性哺乳动物蛋白的数据集上对我们的方法进行了训练和测试。首先,基于ANN开发了用于预测分泌蛋白的模块,该模块使用33种物理化学性质、氨基酸组成和二肽组成,准确率分别达到73.1%、76.1%和77.1%。同样,基于SVM的模块使用33种物理化学性质、氨基酸和二肽组成,分别能够达到77.4%、79.4%和79.9%的准确率。此外,基于相似性搜索设计的用于预测分泌蛋白的BLAST和PSI-BLAST模块,准确率分别为23.4%和26.9%。最后,我们通过整合基于氨基酸和二肽组成的SVM模块以及PSI-BLAST模块开发了一种混合方法,将准确率提高到了83.2%,这明显优于单个模块。我们使用混合模块还实现了60.4%的高灵敏度,假阳性预测值低至5%。基于上述研究开发了一个网络服务器SRTpred,用于从哺乳动物蛋白的全序列中预测经典和非经典分泌蛋白,可从http://www.imtech.res.in/raghava/srtpred/获取。