Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong.
Proteome Sci. 2011 Oct 14;9 Suppl 1(Suppl 1):S8. doi: 10.1186/1477-5956-9-S1-S8.
The functions of proteins are closely related to their subcellular locations. In the post-genomics era, the amount of gene and protein data grows exponentially, which necessitates the prediction of subcellular localization by computational means.
This paper proposes mitigating the computation burden of alignment-based approaches to subcellular localization prediction by a cascaded fusion of cleavage site prediction and profile alignment. Specifically, the informative segments of protein sequences are identified by a cleavage site predictor using the information in their N-terminal shorting signals. Then, the sequences are truncated at the cleavage site positions, and the shortened sequences are passed to PSI-BLAST for computing their profiles. Subcellular localization are subsequently predicted by a profile-to-profile alignment support-vector-machine (SVM) classifier. To further reduce the training and recognition time of the classifier, the SVM classifier is replaced by a new kernel method based on the perturbational discriminant analysis (PDA).
Experimental results on a new dataset based on Swiss-Prot Release 57.5 show that the method can make use of the best property of signal- and homology-based approaches and can attain an accuracy comparable to that achieved by using full-length sequences. Analysis of profile-alignment score matrices suggest that both profile creation time and profile alignment time can be reduced without significant reduction in subcellular localization accuracy. It was found that PDA enjoys a short training time as compared to the conventional SVM. We advocate that the method will be important for biologists to conduct large-scale protein annotation or for bioinformaticians to perform preliminary investigations on new algorithms that involve pairwise alignments.
蛋白质的功能与其亚细胞定位密切相关。在后基因组时代,基因和蛋白质数据的数量呈指数级增长,这就需要通过计算手段来预测亚细胞定位。
本文提出了一种通过切割位点预测和序列比对级联融合来减轻基于比对的亚细胞定位预测计算负担的方法。具体来说,使用序列的 N 端短信号中的信息,通过切割位点预测器识别蛋白质序列的信息片段。然后,在切割位点位置截断序列,并将缩短的序列传递给 PSI-BLAST 计算它们的轮廓。随后,通过基于轮廓到轮廓比对的支持向量机(SVM)分类器预测亚细胞定位。为了进一步减少分类器的训练和识别时间,用一种基于扰动判别分析(PDA)的新核方法代替 SVM 分类器。
在基于 Swiss-Prot Release 57.5 的新数据集上的实验结果表明,该方法可以利用信号和同源性方法的最佳特性,并且可以达到与使用全长序列相当的精度。对轮廓比对得分矩阵的分析表明,在不显著降低亚细胞定位精度的情况下,可以减少轮廓创建时间和轮廓比对时间。与传统的 SVM 相比,PDA 具有较短的训练时间。我们主张该方法对于生物学家进行大规模蛋白质注释或生物信息学家进行涉及两两比对的新算法的初步研究将是重要的。