Nanjing University of Science and Technology, China.
School of Computer Science and Engineering, Nanjing University of Science and Technology, China.
Brief Bioinform. 2021 Nov 5;22(6). doi: 10.1093/bib/bbab278.
Protein subcellular localization plays a crucial role in characterizing the function of proteins and understanding various cellular processes. Therefore, accurate identification of protein subcellular location is an important yet challenging task. Numerous computational methods have been proposed to predict the subcellular location of proteins. However, most existing methods have limited capability in terms of the overall accuracy, time consumption and generalization power. To address these problems, in this study, we developed a novel computational approach based on human protein atlas (HPA) data, referred to as PScL-HDeep, for accurate and efficient image-based prediction of protein subcellular location in human tissues. We extracted different handcrafted and deep learned (by employing pretrained deep learning model) features from different viewpoints of the image. The step-wise discriminant analysis (SDA) algorithm was applied to generate the optimal feature set from each original raw feature set. To further obtain a more informative feature subset, support vector machine-based recursive feature elimination with correlation bias reduction (SVM-RFE + CBR) feature selection algorithm was applied to the integrated feature set. Finally, the classification models, namely support vector machine with radial basis function (SVM-RBF) and support vector machine with linear kernel (SVM-LNR), were learned on the final selected feature set. To evaluate the performance of the proposed method, a new gold standard benchmark training dataset was constructed from the HPA databank. PScL-HDeep achieved the maximum performance on 10-fold cross validation test on this dataset and showed a better efficacy over existing predictors. Furthermore, we also illustrated the generalization ability of the proposed method by conducting a stringent independent validation test.
蛋白质亚细胞定位在描述蛋白质功能和理解各种细胞过程中起着至关重要的作用。因此,准确识别蛋白质亚细胞位置是一项重要但具有挑战性的任务。已经提出了许多计算方法来预测蛋白质的亚细胞位置。然而,大多数现有的方法在整体准确性、时间消耗和泛化能力方面都有一定的局限性。为了解决这些问题,在本研究中,我们开发了一种基于人类蛋白质图谱(HPA)数据的新型计算方法,称为 PScL-HDeep,用于准确、高效地预测人类组织中蛋白质的亚细胞位置。我们从图像的不同视角提取了不同的手工和深度学习(通过使用预先训练的深度学习模型)特征。逐步判别分析(SDA)算法被应用于从每个原始原始特征集中生成最优特征集。为了进一步获得更具信息量的特征子集,基于支持向量机的递归特征消除与相关偏置减少(SVM-RFE+CBR)特征选择算法被应用于集成特征集。最后,支持向量机的径向基函数(SVM-RBF)和支持向量机的线性核(SVM-LNR)的分类模型被应用于最终选择的特征集上进行学习。为了评估所提出方法的性能,我们从 HPA 数据库构建了一个新的黄金标准基准训练数据集。PScL-HDeep 在该数据集上的 10 折交叉验证测试中达到了最大性能,并在现有预测器中表现出更好的效果。此外,我们还通过进行严格的独立验证测试说明了该方法的泛化能力。