PSLDoc：基于间隔二肽和概率潜在语义分析的蛋白质亚细胞定位预测

PSLDoc: Protein subcellular localization prediction based on gapped-dipeptides and probabilistic latent semantic analysis.

作者信息

Chang Jia-Ming, Su Emily Chia-Yu, Lo Allan, Chiu Hua-Sheng, Sung Ting-Yi, Hsu Wen-Lian

机构信息

Bioinformatics Lab, Institute of Information Science, Academia Sinica, Taipei, Taiwan.

出版信息

Proteins. 2008 Aug;72(2):693-710. doi: 10.1002/prot.21944.

DOI:10.1002/prot.21944

PMID:18260102

Abstract

Prediction of protein subcellular localization (PSL) is important for genome annotation, protein function prediction, and drug discovery. Many computational approaches for PSL prediction based on protein sequences have been proposed in recent years for Gram-negative bacteria. We present PSLDoc, a method based on gapped-dipeptides and probabilistic latent semantic analysis (PLSA) to solve this problem. A protein is considered as a term string composed by gapped-dipeptides, which are defined as any two residues separated by one or more positions. The weighting scheme of gapped-dipeptides is calculated according to a position specific score matrix, which includes sequence evolutionary information. Then, PLSA is applied for feature reduction, and reduced vectors are input to five one-versus-rest support vector machine classifiers. The localization site with the highest probability is assigned as the final prediction. It has been reported that there is a strong correlation between sequence homology and subcellular localization (Nair and Rost, Protein Sci 2002;11:2836-2847; Yu et al., Proteins 2006;64:643-651). To properly evaluate the performance of PSLDoc, a target protein can be classified into low- or high-homology data sets. PSLDoc's overall accuracy of low- and high-homology data sets reaches 86.84% and 98.21%, respectively, and it compares favorably with that of CELLO II (Yu et al., Proteins 2006;64:643-651). In addition, we set a confidence threshold to achieve a high precision at specified levels of recall rates. When the confidence threshold is set at 0.7, PSLDoc achieves 97.89% in precision which is considerably better than that of PSORTb v.2.0 (Gardy et al., Bioinformatics 2005;21:617-623). Our approach demonstrates that the specific feature representation for proteins can be successfully applied to the prediction of protein subcellular localization and improves prediction accuracy. Besides, because of the generality of the representation, our method can be extended to eukaryotic proteomes in the future. The web server of PSLDoc is publicly available at http://bio-cluster.iis.sinica.edu.tw/~ bioapp/PSLDoc/.

摘要

蛋白质亚细胞定位预测（PSL）对于基因组注释、蛋白质功能预测及药物研发而言至关重要。近年来，针对革兰氏阴性菌，已提出了许多基于蛋白质序列的PSL预测计算方法。我们提出了PSLDoc，一种基于间隔二肽和概率潜在语义分析（PLSA）的方法来解决此问题。蛋白质被视为由间隔二肽组成的词串，间隔二肽定义为被一个或多个位置隔开的任意两个残基。间隔二肽的加权方案根据包含序列进化信息的位置特异性评分矩阵来计算。然后，将PLSA用于特征约简，并将约简后的向量输入到五个一对其余支持向量机分类器中。概率最高的定位位点被指定为最终预测结果。据报道，序列同源性与亚细胞定位之间存在很强的相关性（奈尔和罗斯特，《蛋白质科学》2002年；11：2836 - 2847；于等人，《蛋白质》2006年；64：643 - 651）。为了正确评估PSLDoc的性能，可将目标蛋白质分类到低同源性或高同源性数据集中。PSLDoc在低同源性和高同源性数据集中的总体准确率分别达到86.84%和98.21%，与CELLO II相比具有优势（于等人，《蛋白质》2006年；64：643 - 651）。此外，我们设置了一个置信阈值，以便在指定召回率水平下实现高精度。当置信阈值设置为0.7时，PSLDoc的精确率达到97.89%，比PSORTb v.2.0要好得多（加迪等人，《生物信息学》2005年；21：617 - 623）。我们的方法表明，蛋白质的特定特征表示可成功应用于蛋白质亚细胞定位预测，并提高预测准确率。此外，由于该表示具有通用性，我们的方法未来可扩展到真核蛋白质组。PSLDoc的网络服务器可在http://bio-cluster.iis.sinica.edu.tw/~ bioapp/PSLDoc/上公开获取。

相似文献

PSLDoc: Protein subcellular localization prediction based on gapped-dipeptides and probabilistic latent semantic analysis.PSLDoc：基于间隔二肽和概率潜在语义分析的蛋白质亚细胞定位预测

Proteins. 2008 Aug;72(2):693-710. doi: 10.1002/prot.21944.

Prediction of protein subcellular localization.蛋白质亚细胞定位预测

Proteins. 2006 Aug 15;64(3):643-51. doi: 10.1002/prot.21018.

SubCellProt: predicting protein subcellular localization using machine learning approaches.SubCellProt：使用机器学习方法预测蛋白质亚细胞定位。

In Silico Biol. 2009;9(1-2):35-44.

Enhanced membrane protein topology prediction using a hierarchical classification method and a new scoring function.使用分层分类方法和新评分函数增强膜蛋白拓扑结构预测

J Proteome Res. 2008 Feb;7(2):487-96. doi: 10.1021/pr0702058.

Prediction of nuclear proteins using nuclear translocation signals proposed by probabilistic latent semantic indexing.基于概率潜在语义索引的核转位信号预测核蛋白。

BMC Bioinformatics. 2012;13 Suppl 17(Suppl 17):S13. doi: 10.1186/1471-2105-13-S17-S13. Epub 2012 Dec 13.

Prediction of subcellular localization of eukaryotic proteins using position-specific profiles and neural network with weighted inputs.利用位置特异性图谱和带加权输入的神经网络预测真核生物蛋白质的亚细胞定位

J Genet Genomics. 2007 Dec;34(12):1080-7. doi: 10.1016/S1673-8527(07)60123-4.

LOCSVMPSI: a web server for subcellular localization of eukaryotic proteins using SVM and profile of PSI-BLAST.LOCSVMPSI：一个利用支持向量机和PSI-BLAST序列谱进行真核生物蛋白质亚细胞定位的网络服务器。

Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W105-10. doi: 10.1093/nar/gki359.

Protein subcellular localization prediction using artificial intelligence technology.利用人工智能技术进行蛋白质亚细胞定位预测。

Methods Mol Biol. 2008;484:435-63. doi: 10.1007/978-1-59745-398-1_27.

Large-scale predictions of gram-negative bacterial protein subcellular locations.革兰氏阴性细菌蛋白质亚细胞定位的大规模预测。

J Proteome Res. 2006 Dec;5(12):3420-8. doi: 10.1021/pr060404b.

pTARGET [corrected] a new method for predicting protein subcellular localization in eukaryotes.pTARGET [已修正] 一种预测真核生物中蛋白质亚细胞定位的新方法。

Bioinformatics. 2005 Nov 1;21(21):3963-9. doi: 10.1093/bioinformatics/bti650. Epub 2005 Sep 6.

引用本文的文献

PRIP: A Protein-RNA Interface Predictor Based on Semantics of Sequences.PRIP：一种基于序列语义的蛋白质-核糖核酸界面预测工具

Life (Basel). 2022 Feb 18;12(2):307. doi: 10.3390/life12020307.

GODoc: high-throughput protein function prediction using novel k-nearest-neighbor and voting algorithms.GODoc：使用新型k近邻和投票算法进行高通量蛋白质功能预测。

BMC Bioinformatics. 2020 Nov 18;21(Suppl 6):276. doi: 10.1186/s12859-020-03556-9.

Structural and Computational Biology in the Design of Immunogenic Vaccine Antigens.结构与计算生物学在免疫原性疫苗抗原设计中的应用。

J Immunol Res. 2015;2015:156241. doi: 10.1155/2015/156241. Epub 2015 Oct 7.

Efficient and interpretable prediction of protein functional classes by correspondence analysis and compact set relations.基于对应分析和紧致集关系的高效可解释蛋白质功能类预测。

PLoS One. 2013 Oct 11;8(10):e75542. doi: 10.1371/journal.pone.0075542. eCollection 2013.

An ensemble method for predicting subnuclear localizations from primary protein structures.一种基于原始蛋白质结构预测亚核定位的集成方法。

PLoS One. 2013;8(2):e57225. doi: 10.1371/journal.pone.0057225. Epub 2013 Feb 27.

EuLoc: a web-server for accurately predict protein subcellular localization in eukaryotes by incorporating various features of sequence segments into the general form of Chou's PseAAC.EuLoc：一个通过将序列片段的各种特征纳入到 Chou 的 PseAAC 的通用形式中，从而准确预测真核生物蛋白质亚细胞定位的网络服务器。

J Comput Aided Mol Des. 2013 Jan;27(1):91-103. doi: 10.1007/s10822-012-9628-0. Epub 2013 Jan 3.

Prediction of nuclear proteins using nuclear translocation signals proposed by probabilistic latent semantic indexing.基于概率潜在语义索引的核转位信号预测核蛋白。

BMC Bioinformatics. 2012;13 Suppl 17(Suppl 17):S13. doi: 10.1186/1471-2105-13-S17-S13. Epub 2012 Dec 13.

Predicted protein subcellular localization in dominant surface ocean bacterioplankton.优势海洋浮游细菌中预测的蛋白亚细胞定位。

Appl Environ Microbiol. 2012 Sep;78(18):6550-7. doi: 10.1128/AEM.01406-12. Epub 2012 Jul 6.

FGsub: Fusarium graminearum protein subcellular localizations predicted from primary structures.FGsub：根据一级结构预测的禾谷镰刀菌蛋白质亚细胞定位

BMC Syst Biol. 2010 Sep 13;4 Suppl 2(Suppl 2):S12. doi: 10.1186/1752-0509-4-S2-S12.

PSORTb 3.0: improved protein subcellular localization prediction with refined localization subcategories and predictive capabilities for all prokaryotes.PSORTb 3.0：通过改进定位亚类和提高对所有原核生物的预测能力，改善了蛋白质亚细胞定位预测。

Bioinformatics. 2010 Jul 1;26(13):1608-15. doi: 10.1093/bioinformatics/btq249. Epub 2010 May 13.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

PSLDoc：基于间隔二肽和概率潜在语义分析的蛋白质亚细胞定位预测

PSLDoc: Protein subcellular localization prediction based on gapped-dipeptides and probabilistic latent semantic analysis.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献