Graduate Institute of Biomedical Informatics, Taipei Medical University, Taipei, Taiwan.
BMC Bioinformatics. 2012;13 Suppl 17(Suppl 17):S13. doi: 10.1186/1471-2105-13-S17-S13. Epub 2012 Dec 13.
Identification of subcellular localization in proteins is crucial to elucidate cellular processes and molecular functions in a cell. However, given a tremendous amount of sequence data generated in the post-genomic era, determining protein localization based on biological experiments can be expensive and time-consuming. Therefore, developing prediction systems to analyze uncharacterised proteins efficiently has played an important role in high-throughput protein analyses. In a eukaryotic cell, many essential biological processes take place in the nucleus. Nuclear proteins shuttle between nucleus and cytoplasm based on recognition of nuclear translocation signals, including nuclear localization signals (NLSs) and nuclear export signals (NESs). Currently, only a few approaches have been developed specifically to predict nuclear localization using sequence features, such as putative NLSs. However, it has been shown that prediction coverage based on the NLSs is very low. In addition, most existing approaches only attained prediction accuracy and Matthew's correlation coefficient (MCC) around 54%70% and 0.2500.380 on independent test set, respectively. Moreover, no predictor can generate sequence motifs to characterize features of potential NESs, in which biological properties are not well understood from existing experimental studies.
In this study, first we propose PSLNuc (Protein Subcellular Localization prediction for Nucleus) for predicting nuclear localization in proteins. First, for feature representation, a protein is represented by gapped-dipeptides and the feature values are weighted by homology information from a smoothed position-specific scoring matrix. After that, we incorporate probabilistic latent semantic indexing (PLSI) for feature reduction. Finally, the reduced features are used as input for a support vector machine (SVM) classifier. In addition to PSLNuc, we further identify gapped-dipeptide signatures for putative NLSs and NESs to develop a prediction method, PSLNTS (Protein Subcellular Localization prediction using Nuclear Translocation Signals). We apply PLSI to generate gapped-dipeptide signatures from both nuclear and non-nuclear proteins, and propose candidate sequence motifs for putative NLSs and NESs. Then, we incorporate only the proposed gapped-dipeptide signatures in an SVM classifier to mimic biological properties of NLSs and NESs for predicting nuclear localization in PSLNTS.
Experiment results demonstrate that the proposed method shows a significant improvement for nuclear localization prediction. To compare our predictive performance with other approaches, we incorporate two non-redundant benchmark data sets, a training set and an independent test set. Evaluated by five-fold cross-validation on the training set, PSLNuc attains an overall accuracy of 79.7%, which is 4.8% improvement over the state-of-the-art system. In addition, our method also enhances the MCC from 0.497 to 0.595. Compared on the independent test set, PSLNuc outperforms other predictors by 3.9%19.9% on accuracy and 0.0770.207 on MCC. This suggests that, in addition to NLSs, which have been shown important for nuclear proteins, NESs can also be an effective indicator to detect non-nuclear proteins. Most notably, using only a few proposed gapped-dipeptide signatures as input features for the SVM classifier, PSLNTS further enhances the accuracy and MCC to 80.9% and 0.618, respectively. Our results demonstrate that gapped-dipeptide signatures can better discriminate nuclear and non-nuclear proteins. Moreover, the proposed gapped-dipeptide signatures can be biologically interpreted and used in further experiment analyses of nuclear translocation signals, including NLSs and NESs.
在蛋白质中鉴定亚细胞定位对于阐明细胞过程和分子功能至关重要。然而,在基因组时代之后产生了大量的序列数据,基于生物实验确定蛋白质定位可能既昂贵又耗时。因此,开发用于高效分析未表征蛋白质的预测系统在高通量蛋白质分析中发挥了重要作用。在真核细胞中,许多重要的生物过程发生在细胞核中。核蛋白根据识别核转位信号,包括核定位信号(NLSs)和核输出信号(NESs),在核和细胞质之间穿梭。目前,只有少数方法专门用于使用序列特征(如假定的 NLSs)来预测核定位。然而,已经表明基于 NLSs 的预测覆盖率非常低。此外,大多数现有方法在独立测试集上的预测准确率和马修斯相关系数(MCC)分别仅达到 54%70%和 0.2500.380。此外,没有预测器可以生成序列基序来描述潜在 NES 的特征,这些特征的生物学性质尚未从现有实验研究中很好地理解。
在这项研究中,我们首先提出了 PSLNuc(用于核内蛋白质亚细胞定位预测的蛋白质)来预测蛋白质的核定位。首先,对于特征表示,蛋白质由缺口二肽表示,特征值由平滑位置特异性评分矩阵的同源信息加权。之后,我们结合了概率潜在语义索引(PLSI)进行特征降维。最后,将降维后的特征作为支持向量机(SVM)分类器的输入。除了 PSLNuc,我们还进一步鉴定了潜在 NLSs 和 NESs 的缺口二肽特征,以开发一种预测方法 PSLNTS(使用核转位信号进行蛋白质亚细胞定位预测)。我们应用 PLSI 从核蛋白和非核蛋白中生成缺口二肽特征,并提出潜在 NLSs 和 NESs 的候选序列基序。然后,我们仅将提出的缺口二肽特征纳入 SVM 分类器中,以模拟 NLSs 和 NESs 的生物学特性,从而在 PSLNTS 中预测核定位。
实验结果表明,该方法在核定位预测方面取得了显著的改进。为了将我们的预测性能与其他方法进行比较,我们纳入了两个非冗余的基准数据集,一个训练集和一个独立测试集。在训练集上进行五重交叉验证评估时,PSLNuc 的整体准确率达到 79.7%,比最先进的系统提高了 4.8%。此外,我们的方法还将 MCC 从 0.497 提高到 0.595。在独立测试集上进行比较时,PSLNuc 在准确率上比其他预测器高出 3.9%19.9%,在 MCC 上高出 0.0770.207。这表明,除了已被证明对核蛋白很重要的 NLSs 外,NESs 也可以作为检测非核蛋白的有效指标。最值得注意的是,使用 SVM 分类器输入特征中仅提出的几个缺口二肽特征,PSLNTS 进一步将准确率和 MCC 提高到 80.9%和 0.618。我们的结果表明,缺口二肽特征可以更好地区分核蛋白和非核蛋白。此外,提出的缺口二肽特征可以进行生物学解释,并用于核转位信号的进一步实验分析,包括 NLSs 和 NESs。