基于概率潜在语义索引的核转位信号预测核蛋白。

Graduate Institute of Biomedical Informatics, Taipei Medical University, Taipei, Taiwan.

BMC Bioinformatics. 2012;13 Suppl 17(Suppl 17):S13. doi: 10.1186/1471-2105-13-S17-S13. Epub 2012 Dec 13.

BACKGROUND

Identification of subcellular localization in proteins is crucial to elucidate cellular processes and molecular functions in a cell. However, given a tremendous amount of sequence data generated in the post-genomic era, determining protein localization based on biological experiments can be expensive and time-consuming. Therefore, developing prediction systems to analyze uncharacterised proteins efficiently has played an important role in high-throughput protein analyses. In a eukaryotic cell, many essential biological processes take place in the nucleus. Nuclear proteins shuttle between nucleus and cytoplasm based on recognition of nuclear translocation signals, including nuclear localization signals (NLSs) and nuclear export signals (NESs). Currently, only a few approaches have been developed specifically to predict nuclear localization using sequence features, such as putative NLSs. However, it has been shown that prediction coverage based on the NLSs is very low. In addition, most existing approaches only attained prediction accuracy and Matthew's correlation coefficient (MCC) around 54%~~70% and 0.250~~0.380 on independent test set, respectively. Moreover, no predictor can generate sequence motifs to characterize features of potential NESs, in which biological properties are not well understood from existing experimental studies.

RESULTS

In this study, first we propose PSLNuc (Protein Subcellular Localization prediction for Nucleus) for predicting nuclear localization in proteins. First, for feature representation, a protein is represented by gapped-dipeptides and the feature values are weighted by homology information from a smoothed position-specific scoring matrix. After that, we incorporate probabilistic latent semantic indexing (PLSI) for feature reduction. Finally, the reduced features are used as input for a support vector machine (SVM) classifier. In addition to PSLNuc, we further identify gapped-dipeptide signatures for putative NLSs and NESs to develop a prediction method, PSLNTS (Protein Subcellular Localization prediction using Nuclear Translocation Signals). We apply PLSI to generate gapped-dipeptide signatures from both nuclear and non-nuclear proteins, and propose candidate sequence motifs for putative NLSs and NESs. Then, we incorporate only the proposed gapped-dipeptide signatures in an SVM classifier to mimic biological properties of NLSs and NESs for predicting nuclear localization in PSLNTS.

CONCLUSIONS

Experiment results demonstrate that the proposed method shows a significant improvement for nuclear localization prediction. To compare our predictive performance with other approaches, we incorporate two non-redundant benchmark data sets, a training set and an independent test set. Evaluated by five-fold cross-validation on the training set, PSLNuc attains an overall accuracy of 79.7%, which is 4.8% improvement over the state-of-the-art system. In addition, our method also enhances the MCC from 0.497 to 0.595. Compared on the independent test set, PSLNuc outperforms other predictors by 3.9%~~19.9% on accuracy and 0.077~~0.207 on MCC. This suggests that, in addition to NLSs, which have been shown important for nuclear proteins, NESs can also be an effective indicator to detect non-nuclear proteins. Most notably, using only a few proposed gapped-dipeptide signatures as input features for the SVM classifier, PSLNTS further enhances the accuracy and MCC to 80.9% and 0.618, respectively. Our results demonstrate that gapped-dipeptide signatures can better discriminate nuclear and non-nuclear proteins. Moreover, the proposed gapped-dipeptide signatures can be biologically interpreted and used in further experiment analyses of nuclear translocation signals, including NLSs and NESs.

背景

在蛋白质中鉴定亚细胞定位对于阐明细胞过程和分子功能至关重要。然而，在基因组时代之后产生了大量的序列数据，基于生物实验确定蛋白质定位可能既昂贵又耗时。因此，开发用于高效分析未表征蛋白质的预测系统在高通量蛋白质分析中发挥了重要作用。在真核细胞中，许多重要的生物过程发生在细胞核中。核蛋白根据识别核转位信号，包括核定位信号（NLSs）和核输出信号（NESs），在核和细胞质之间穿梭。目前，只有少数方法专门用于使用序列特征（如假定的 NLSs）来预测核定位。然而，已经表明基于 NLSs 的预测覆盖率非常低。此外，大多数现有方法在独立测试集上的预测准确率和马修斯相关系数（MCC）分别仅达到 54%~~70%和 0.250~~0.380。此外，没有预测器可以生成序列基序来描述潜在 NES 的特征，这些特征的生物学性质尚未从现有实验研究中很好地理解。

结果

在这项研究中，我们首先提出了 PSLNuc（用于核内蛋白质亚细胞定位预测的蛋白质）来预测蛋白质的核定位。首先，对于特征表示，蛋白质由缺口二肽表示，特征值由平滑位置特异性评分矩阵的同源信息加权。之后，我们结合了概率潜在语义索引（PLSI）进行特征降维。最后，将降维后的特征作为支持向量机（SVM）分类器的输入。除了 PSLNuc，我们还进一步鉴定了潜在 NLSs 和 NESs 的缺口二肽特征，以开发一种预测方法 PSLNTS（使用核转位信号进行蛋白质亚细胞定位预测）。我们应用 PLSI 从核蛋白和非核蛋白中生成缺口二肽特征，并提出潜在 NLSs 和 NESs 的候选序列基序。然后，我们仅将提出的缺口二肽特征纳入 SVM 分类器中，以模拟 NLSs 和 NESs 的生物学特性，从而在 PSLNTS 中预测核定位。

结论

实验结果表明，该方法在核定位预测方面取得了显著的改进。为了将我们的预测性能与其他方法进行比较，我们纳入了两个非冗余的基准数据集，一个训练集和一个独立测试集。在训练集上进行五重交叉验证评估时，PSLNuc 的整体准确率达到 79.7%，比最先进的系统提高了 4.8%。此外，我们的方法还将 MCC 从 0.497 提高到 0.595。在独立测试集上进行比较时，PSLNuc 在准确率上比其他预测器高出 3.9%~~19.9%，在 MCC 上高出 0.077~~0.207。这表明，除了已被证明对核蛋白很重要的 NLSs 外，NESs 也可以作为检测非核蛋白的有效指标。最值得注意的是，使用 SVM 分类器输入特征中仅提出的几个缺口二肽特征，PSLNTS 进一步将准确率和 MCC 提高到 80.9%和 0.618。我们的结果表明，缺口二肽特征可以更好地区分核蛋白和非核蛋白。此外，提出的缺口二肽特征可以进行生物学解释，并用于核转位信号的进一步实验分析，包括 NLSs 和 NESs。

相似文献

Prediction of nuclear proteins using nuclear translocation signals proposed by probabilistic latent semantic indexing.

BMC Bioinformatics. 2012;13 Suppl 17(Suppl 17):S13. doi: 10.1186/1471-2105-13-S17-S13. Epub 2012 Dec 13.

Protein subcellular localization prediction based on compartment-specific features and structure conservation.

BMC Bioinformatics. 2007 Sep 8;8:330. doi: 10.1186/1471-2105-8-330.

PSLDoc: Protein subcellular localization prediction based on gapped-dipeptides and probabilistic latent semantic analysis.

Proteins. 2008 Aug;72(2):693-710. doi: 10.1002/prot.21944.

CrystalM: A Multi-View Fusion Approach for Protein Crystallization Prediction.

IEEE/ACM Trans Comput Biol Bioinform. 2021 Jan-Feb;18(1):325-335. doi: 10.1109/TCBB.2019.2912173. Epub 2021 Feb 3.

Predicting nuclear localization.

J Proteome Res. 2007 Apr;6(4):1402-9. doi: 10.1021/pr060564n. Epub 2007 Feb 24.

Prediction of Protein Subcellular Localization Based on Fusion of Multi-view Features.

Molecules. 2019 Mar 6;24(5):919. doi: 10.3390/molecules24050919.

ProLoc-GO: utilizing informative Gene Ontology terms for sequence-based prediction of protein subcellular localization.

BMC Bioinformatics. 2008 Feb 1;9:80. doi: 10.1186/1471-2105-9-80.

Efficient and interpretable prediction of protein functional classes by correspondence analysis and compact set relations.

PLoS One. 2013 Oct 11;8(10):e75542. doi: 10.1371/journal.pone.0075542. eCollection 2013.

Prediction of nuclear proteins using SVM and HMM models.

BMC Bioinformatics. 2009 Jan 19;10:22. doi: 10.1186/1471-2105-10-22.

Protein sequence information extraction and subcellular localization prediction with gapped k-Mer method.

BMC Bioinformatics. 2019 Dec 30;20(Suppl 22):719. doi: 10.1186/s12859-019-3232-4.

引用本文的文献

INPP5F translocates into cytoplasm and interacts with ASPH to promote tumor growth in hepatocellular carcinoma.

J Exp Clin Cancer Res. 2022 Jan 7;41(1):13. doi: 10.1186/s13046-021-02216-x.

Ensemble of Multiple Classifiers for Multilabel Classification of Plant Protein Subcellular Localization.

Life (Basel). 2021 Mar 30;11(4):293. doi: 10.3390/life11040293.

Analysis of Protein-Targeting in the Nucleus of Host Cells and the Implications in Colon Cancer: An in-silico Approach.

Infect Drug Resist. 2020 Jul 20;13:2433-2442. doi: 10.2147/IDR.S258037. eCollection 2020.

Computational prediction of Mycoplasma hominis proteins targeting in nucleus of host cell and their implication in prostate cancer etiology.

Tumour Biol. 2016 Aug;37(8):10805-13. doi: 10.1007/s13277-016-4970-9. Epub 2016 Feb 13.

Systems Biology Approaches for the Prediction of Possible Role of Chlamydia pneumoniae Proteins in the Etiology of Lung Cancer.

PLoS One. 2016 Feb 12;11(2):e0148530. doi: 10.1371/journal.pone.0148530. eCollection 2016.

Role of Ca/CaN/NFAT signaling in IL-4 expression by splenic lymphocytes exposed to phthalate (2-ethylhexyl) ester in spleen lymphocytes.

Mol Biol Rep. 2014;41(4):2129-42. doi: 10.1007/s11033-014-3062-4. Epub 2014 Jan 14.

Efficient and interpretable prediction of protein functional classes by correspondence analysis and compact set relations.

PLoS One. 2013 Oct 11;8(10):e75542. doi: 10.1371/journal.pone.0075542. eCollection 2013.

InCoB2012 Conference: from biological data to knowledge to technological breakthroughs.

BMC Bioinformatics. 2012;13 Suppl 17(Suppl 17):S1. doi: 10.1186/1471-2105-13-S17-S1. Epub 2012 Dec 13.

本文引用的文献

NoD: a Nucleolar localization sequence detector for eukaryotic and viral proteins.

BMC Bioinformatics. 2011 Aug 3;12:317. doi: 10.1186/1471-2105-12-317.

NLStradamus: a simple Hidden Markov Model for nuclear localization signal prediction.

BMC Bioinformatics. 2009 Jun 29;10:202. doi: 10.1186/1471-2105-10-202.

Prediction of nuclear proteins using SVM and HMM models.

BMC Bioinformatics. 2009 Jan 19;10:22. doi: 10.1186/1471-2105-10-22.

Predicting RNA-binding sites of proteins using support vector machines and evolutionary information.

BMC Bioinformatics. 2008 Dec 12;9 Suppl 12(Suppl 12):S6. doi: 10.1186/1471-2105-9-S12-S6.

PSLDoc: Protein subcellular localization prediction based on gapped-dipeptides and probabilistic latent semantic analysis.

Proteins. 2008 Aug;72(2):693-710. doi: 10.1002/prot.21944.

Nuc-PLoc: a new web-server for predicting protein subnuclear localization by fusing PseAA composition and PsePSSM.

Protein Eng Des Sel. 2007 Nov;20(11):561-7. doi: 10.1093/protein/gzm057. Epub 2007 Nov 10.

Protein subcellular localization prediction based on compartment-specific features and structure conservation.

BMC Bioinformatics. 2007 Sep 8;8:330. doi: 10.1186/1471-2105-8-330.

NucPred--predicting nuclear localization of proteins.

Bioinformatics. 2007 May 1;23(9):1159-60. doi: 10.1093/bioinformatics/btm066. Epub 2007 Mar 1.

Predicting nuclear localization.

J Proteome Res. 2007 Apr;6(4):1402-9. doi: 10.1021/pr060564n. Epub 2007 Feb 24.

ProLoc: prediction of protein subnuclear localization using SVM with automatic selection from physicochemical composition features.

Biosystems. 2007 Sep-Oct;90(2):573-81. doi: 10.1016/j.biosystems.2007.01.001. Epub 2007 Jan 4.

Suppr 超能文献

核心技术专利：CN118964589B侵权必究

相似文献

Prediction of nuclear proteins using nuclear translocation signals proposed by probabilistic latent semantic indexing.

BMC Bioinformatics. 2012;13 Suppl 17(Suppl 17):S13. doi: 10.1186/1471-2105-13-S17-S13. Epub 2012 Dec 13.

Protein subcellular localization prediction based on compartment-specific features and structure conservation.

BMC Bioinformatics. 2007 Sep 8;8:330. doi: 10.1186/1471-2105-8-330.

PSLDoc: Protein subcellular localization prediction based on gapped-dipeptides and probabilistic latent semantic analysis.

Proteins. 2008 Aug;72(2):693-710. doi: 10.1002/prot.21944.

CrystalM: A Multi-View Fusion Approach for Protein Crystallization Prediction.

IEEE/ACM Trans Comput Biol Bioinform. 2021 Jan-Feb;18(1):325-335. doi: 10.1109/TCBB.2019.2912173. Epub 2021 Feb 3.

Predicting nuclear localization.

J Proteome Res. 2007 Apr;6(4):1402-9. doi: 10.1021/pr060564n. Epub 2007 Feb 24.

Prediction of Protein Subcellular Localization Based on Fusion of Multi-view Features.

Molecules. 2019 Mar 6;24(5):919. doi: 10.3390/molecules24050919.

ProLoc-GO: utilizing informative Gene Ontology terms for sequence-based prediction of protein subcellular localization.

BMC Bioinformatics. 2008 Feb 1;9:80. doi: 10.1186/1471-2105-9-80.

Efficient and interpretable prediction of protein functional classes by correspondence analysis and compact set relations.

PLoS One. 2013 Oct 11;8(10):e75542. doi: 10.1371/journal.pone.0075542. eCollection 2013.

Prediction of nuclear proteins using SVM and HMM models.

BMC Bioinformatics. 2009 Jan 19;10:22. doi: 10.1186/1471-2105-10-22.

Protein sequence information extraction and subcellular localization prediction with gapped k-Mer method.

BMC Bioinformatics. 2019 Dec 30;20(Suppl 22):719. doi: 10.1186/s12859-019-3232-4.

引用本文的文献

INPP5F translocates into cytoplasm and interacts with ASPH to promote tumor growth in hepatocellular carcinoma.

J Exp Clin Cancer Res. 2022 Jan 7;41(1):13. doi: 10.1186/s13046-021-02216-x.

Ensemble of Multiple Classifiers for Multilabel Classification of Plant Protein Subcellular Localization.

Life (Basel). 2021 Mar 30;11(4):293. doi: 10.3390/life11040293.

Analysis of Protein-Targeting in the Nucleus of Host Cells and the Implications in Colon Cancer: An in-silico Approach.

Infect Drug Resist. 2020 Jul 20;13:2433-2442. doi: 10.2147/IDR.S258037. eCollection 2020.

Computational prediction of Mycoplasma hominis proteins targeting in nucleus of host cell and their implication in prostate cancer etiology.

Tumour Biol. 2016 Aug;37(8):10805-13. doi: 10.1007/s13277-016-4970-9. Epub 2016 Feb 13.

Systems Biology Approaches for the Prediction of Possible Role of Chlamydia pneumoniae Proteins in the Etiology of Lung Cancer.

PLoS One. 2016 Feb 12;11(2):e0148530. doi: 10.1371/journal.pone.0148530. eCollection 2016.

Role of Ca/CaN/NFAT signaling in IL-4 expression by splenic lymphocytes exposed to phthalate (2-ethylhexyl) ester in spleen lymphocytes.

Mol Biol Rep. 2014;41(4):2129-42. doi: 10.1007/s11033-014-3062-4. Epub 2014 Jan 14.

Efficient and interpretable prediction of protein functional classes by correspondence analysis and compact set relations.

PLoS One. 2013 Oct 11;8(10):e75542. doi: 10.1371/journal.pone.0075542. eCollection 2013.

InCoB2012 Conference: from biological data to knowledge to technological breakthroughs.

BMC Bioinformatics. 2012;13 Suppl 17(Suppl 17):S1. doi: 10.1186/1471-2105-13-S17-S1. Epub 2012 Dec 13.

本文引用的文献

NoD: a Nucleolar localization sequence detector for eukaryotic and viral proteins.

BMC Bioinformatics. 2011 Aug 3;12:317. doi: 10.1186/1471-2105-12-317.

NLStradamus: a simple Hidden Markov Model for nuclear localization signal prediction.

BMC Bioinformatics. 2009 Jun 29;10:202. doi: 10.1186/1471-2105-10-202.

Prediction of nuclear proteins using SVM and HMM models.

BMC Bioinformatics. 2009 Jan 19;10:22. doi: 10.1186/1471-2105-10-22.

Predicting RNA-binding sites of proteins using support vector machines and evolutionary information.

BMC Bioinformatics. 2008 Dec 12;9 Suppl 12(Suppl 12):S6. doi: 10.1186/1471-2105-9-S12-S6.

PSLDoc: Protein subcellular localization prediction based on gapped-dipeptides and probabilistic latent semantic analysis.

Proteins. 2008 Aug;72(2):693-710. doi: 10.1002/prot.21944.

Nuc-PLoc: a new web-server for predicting protein subnuclear localization by fusing PseAA composition and PsePSSM.

Protein Eng Des Sel. 2007 Nov;20(11):561-7. doi: 10.1093/protein/gzm057. Epub 2007 Nov 10.

Protein subcellular localization prediction based on compartment-specific features and structure conservation.

BMC Bioinformatics. 2007 Sep 8;8:330. doi: 10.1186/1471-2105-8-330.

NucPred--predicting nuclear localization of proteins.

Bioinformatics. 2007 May 1;23(9):1159-60. doi: 10.1093/bioinformatics/btm066. Epub 2007 Mar 1.

Predicting nuclear localization.

J Proteome Res. 2007 Apr;6(4):1402-9. doi: 10.1021/pr060564n. Epub 2007 Feb 24.

ProLoc: prediction of protein subnuclear localization using SVM with automatic selection from physicochemical composition features.

Biosystems. 2007 Sep-Oct;90(2):573-81. doi: 10.1016/j.biosystems.2007.01.001. Epub 2007 Jan 4.

Prediction of nuclear proteins using nuclear translocation signals proposed by probabilistic latent semantic indexing.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献