School of Medicine and Health Sciences, Universidad del Rosario, Carrera 24 No, 63C-69, Bogotá DC, Colombia.
BMC Bioinformatics. 2011 Jan 14;12:21. doi: 10.1186/1471-2105-12-21.
Most predictive methods currently available for the identification of protein secretion mechanisms have focused on classically secreted proteins. In fact, only two methods have been reported for predicting non-classically secreted proteins of Gram-positive bacteria. This study describes the implementation of a sequence-based classifier, denoted as NClassG+, for identifying non-classically secreted Gram-positive bacterial proteins.
Several feature-based classifiers were trained using different sequence transformation vectors (frequencies, dipeptides, physicochemical factors and PSSM) and Support Vector Machines (SVMs) with Linear, Polynomial and Gaussian kernel functions. Nested k-fold cross-validation (CV) was applied to select the best models, using the inner CV loop to tune the model parameters and the outer CV group to compute the error. The parameters and Kernel functions and the combinations between all possible feature vectors were optimized using grid search.
The final model was tested against an independent set not previously seen by the model, obtaining better predictive performance compared to SecretomeP V2.0 and SecretPV2.0 for the identification of non-classically secreted proteins. NClassG+ is freely available on the web at http://www.biolisi.unal.edu.co/web-servers/nclassgpositive/.
目前可用于识别蛋白质分泌机制的大多数预测方法都集中在经典分泌蛋白上。事实上,只有两种方法被报道用于预测革兰氏阳性菌的非经典分泌蛋白。本研究描述了一种基于序列的分类器(记为 NClassG+)的实现,用于识别非经典分泌的革兰氏阳性细菌蛋白。
使用不同的序列变换向量(频率、二肽、物理化学因子和 PSSM)和支持向量机(SVM),包括线性、多项式和高斯核函数,训练了几种基于特征的分类器。采用嵌套 k 折交叉验证(CV)来选择最佳模型,使用内部 CV 循环调整模型参数,使用外部 CV 组计算误差。使用网格搜索优化参数、核函数以及所有可能特征向量的组合。
该最终模型针对一个独立的数据集进行了测试,该数据集以前未被模型看到,与 SecretomeP V2.0 和 SecretPV2.0 相比,该模型在识别非经典分泌蛋白方面具有更好的预测性能。NClassG+可在 http://www.biolisi.unal.edu.co/web-servers/nclassgpositive/ 上免费获得。