School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China.
IEEE Trans Nanobioscience. 2012 Dec;11(4):375-85. doi: 10.1109/TNB.2012.2208473. Epub 2012 Aug 3.
Membrane proteins are encoded by ~ 30% in the genome and function importantly in the living organisms. Previous studies have revealed that membrane proteins' structures and functions show obvious cell organelle-specific properties. Hence, it is highly desired to predict membrane protein's subcellular location from the primary sequence considering the extreme difficulties of membrane protein wet-lab studies. Although many models have been developed for predicting protein subcellular locations, only a few are specific to membrane proteins. Existing prediction approaches were constructed based on statistical machine learning algorithms with serial combination of multi-view features, i.e., different feature vectors are simply serially combined to form a super feature vector. However, such simple combination of features will simultaneously increase the information redundancy that could, in turn, deteriorate the final prediction accuracy. That's why it was often found that prediction success rates in the serial super space were even lower than those in a single-view space. The purpose of this paper is investigation of a proper method for fusing multiple multi-view protein sequential features for subcellular location predictions. Instead of serial strategy, we propose a novel parallel framework for fusing multiple membrane protein multi-view attributes that will represent protein samples in complex spaces. We also proposed generalized principle component analysis (GPCA) for feature reduction purpose in the complex geometry. All the experimental results through different machine learning algorithms on benchmark membrane protein subcellular localization datasets demonstrate that the newly proposed parallel strategy outperforms the traditional serial approach. We also demonstrate the efficacy of the parallel strategy on a soluble protein subcellular localization dataset indicating the parallel technique is flexible to suite for other computational biology problems. The software and datasets are available at: http://www.csbio.sjtu.edu.cn/bioinf/mpsp.
膜蛋白由基因组中约 30%的编码,在生物体内发挥着重要作用。先前的研究表明,膜蛋白的结构和功能表现出明显的细胞器特异性。因此,考虑到膜蛋白湿实验研究的极端困难,从一级序列预测膜蛋白的亚细胞位置是非常需要的。尽管已经开发了许多用于预测蛋白质亚细胞位置的模型,但只有少数是专门针对膜蛋白的。现有的预测方法是基于统计机器学习算法构建的,这些算法串行组合了多视图特征,即不同的特征向量简单地串行组合成一个超级特征向量。然而,这种特征的简单组合将同时增加信息冗余,从而可能降低最终的预测精度。这就是为什么在串行超级空间中发现预测成功率甚至低于单视图空间的原因。本文的目的是研究一种合适的方法,用于融合多个多视图蛋白质序列特征进行亚细胞位置预测。我们提出了一种新的并行框架,用于融合多个膜蛋白多视图属性,以便在复杂空间中表示蛋白质样本。我们还提出了广义主成分分析(GPCA)用于特征降维目的在复杂的几何形状。通过在基准膜蛋白亚细胞定位数据集上使用不同的机器学习算法进行的所有实验结果表明,新提出的并行策略优于传统的串行方法。我们还在可溶性蛋白质亚细胞定位数据集上证明了并行策略的有效性,表明并行技术灵活适用于其他计算生物学问题。软件和数据集可在:http://www.csbio.sjtu.edu.cn/bioinf/mpsp。