Kumar Manish, Raghava Gajendra P S
Bioinformatics Centre, Institute of Microbial Technology, Chandigarh, India.
BMC Bioinformatics. 2009 Jan 19;10:22. doi: 10.1186/1471-2105-10-22.
The nucleus, a highly organized organelle, plays important role in cellular homeostasis. The nuclear proteins are crucial for chromosomal maintenance/segregation, gene expression, RNA processing/export, and many other processes. Several methods have been developed for predicting the nuclear proteins in the past. The aim of the present study is to develop a new method for predicting nuclear proteins with higher accuracy.
All modules were trained and tested on a non-redundant dataset and evaluated using five-fold cross-validation technique. Firstly, Support Vector Machines (SVM) based modules have been developed using amino acid and dipeptide compositions and achieved a Mathews correlation coefficient (MCC) of 0.59 and 0.61 respectively. Secondly, we have developed SVM modules using split amino acid compositions (SAAC) and achieved the maximum MCC of 0.66. Thirdly, a hidden Markov model (HMM) based module/profile was developed for searching exclusively nuclear and non-nuclear domains in a protein. Finally, a hybrid module was developed by combining SVM module and HMM profile and achieved a MCC of 0.87 with an accuracy of 94.61%. This method performs better than the existing methods when evaluated on blind/independent datasets. Our method estimated 31.51%, 21.89%, 26.31%, 25.72% and 24.95% of the proteins as nuclear proteins in Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster, mouse and human proteomes respectively. Based on the above modules, we have developed a web server NpPred for predicting nuclear proteins http://www.imtech.res.in/raghava/nppred/.
This study describes a highly accurate method for predicting nuclear proteins. SVM module has been developed for the first time using SAAC for predicting nuclear proteins, where amino acid composition of N-terminus and the remaining protein were computed separately. In addition, our study is a first documentation where exclusively nuclear and non-nuclear domains have been identified and used for predicting nuclear proteins. The performance of the method improved further by combining both approaches together.
细胞核是一种高度组织化的细胞器,在细胞内稳态中发挥着重要作用。核蛋白对于染色体的维持/分离、基因表达、RNA加工/输出以及许多其他过程至关重要。过去已经开发了几种预测核蛋白的方法。本研究的目的是开发一种预测准确率更高的核蛋白预测新方法。
所有模块均在一个非冗余数据集上进行训练和测试,并使用五折交叉验证技术进行评估。首先,利用氨基酸和二肽组成开发了基于支持向量机(SVM)的模块,其马修斯相关系数(MCC)分别达到0.59和0.61。其次,我们利用拆分氨基酸组成(SAAC)开发了SVM模块,最高MCC达到0.66。第三,开发了一种基于隐马尔可夫模型(HMM)的模块/轮廓,用于专门搜索蛋白质中的核域和非核域。最后,通过将SVM模块和HMM轮廓相结合开发了一个混合模块,其MCC达到0.87,准确率为94.61%。在对盲/独立数据集进行评估时,该方法的表现优于现有方法。我们的方法分别估计酿酒酵母、秀丽隐杆线虫、黑腹果蝇、小鼠和人类蛋白质组中31.51%、21.89%、26.31%、25.72%和24.95%的蛋白质为核蛋白。基于上述模块,我们开发了一个用于预测核蛋白的网络服务器NpPred,网址为http://www.imtech.res.in/raghava/nppred/。
本研究描述了一种预测核蛋白的高精度方法。首次利用SAAC开发了用于预测核蛋白的SVM模块,其中分别计算了N端的氨基酸组成和其余蛋白质的氨基酸组成。此外,我们的研究首次记录了专门识别核域和非核域并将其用于预测核蛋白的情况。将两种方法结合在一起进一步提高了该方法的性能。