He Jianjun, Gu Hong, Liu Wenqi
School of Control Science and Engineering, Dalian University of Technology, Dalian, Liaoning, China.
PLoS One. 2012;7(6):e37155. doi: 10.1371/journal.pone.0037155. Epub 2012 Jun 8.
It is well known that an important step toward understanding the functions of a protein is to determine its subcellular location. Although numerous prediction algorithms have been developed, most of them typically focused on the proteins with only one location. In recent years, researchers have begun to pay attention to the subcellular localization prediction of the proteins with multiple sites. However, almost all the existing approaches have failed to take into account the correlations among the locations caused by the proteins with multiple sites, which may be the important information for improving the prediction accuracy of the proteins with multiple sites. In this paper, a new algorithm which can effectively exploit the correlations among the locations is proposed by using gaussian process model. Besides, the algorithm also can realize optimal linear combination of various feature extraction technologies and could be robust to the imbalanced data set. Experimental results on a human protein data set show that the proposed algorithm is valid and can achieve better performance than the existing approaches.
众所周知,理解蛋白质功能的一个重要步骤是确定其亚细胞定位。尽管已经开发了许多预测算法,但它们大多通常只关注单一位置的蛋白质。近年来,研究人员开始关注多位点蛋白质的亚细胞定位预测。然而,几乎所有现有的方法都未能考虑多位点蛋白质所导致的不同定位之间的相关性,而这些相关性可能是提高多位点蛋白质预测准确性的重要信息。本文提出了一种利用高斯过程模型有效利用这些定位相关性的新算法。此外,该算法还能实现各种特征提取技术的最优线性组合,并且对不平衡数据集具有鲁棒性。在一个人类蛋白质数据集上的实验结果表明,所提出的算法是有效的,并且能够比现有方法取得更好的性能。