Suppr
超能文献

基于对应分析和紧致集关系的高效可解释蛋白质功能类预测。

Efficient and interpretable prediction of protein functional classes by correspondence analysis and compact set relations.

机构信息

Comparative Bioinformatics, Bioinformatics and Genomics, Centre for Genomic Regulation (CRG), Barcelona, Spain ; Universitat Pompeu Fabra (UPF), Barcelona, Spain.

出版信息

PLoS One. 2013 Oct 11;8(10):e75542. doi: 10.1371/journal.pone.0075542. eCollection 2013.

DOI:10.1371/journal.pone.0075542

PMID:24146760

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3795737/

Abstract

Predicting protein functional classes such as localization sites and modifications plays a crucial role in function annotation. Given a tremendous amount of sequence data yielded from high-throughput sequencing experiments, the need of efficient and interpretable prediction strategies has been rapidly amplified. Our previous approach for subcellular localization prediction, PSLDoc, archives high overall accuracy for Gram-negative bacteria. However, PSLDoc is computational intensive due to incorporation of homology extension in feature extraction and probabilistic latent semantic analysis in feature reduction. Besides, prediction results generated by support vector machines are accurate but generally difficult to interpret. In this work, we incorporate three new techniques to improve efficiency and interpretability. First, homology extension is performed against a compact non-redundant database using a fast search model to reduce running time. Second, correspondence analysis (CA) is incorporated as an efficient feature reduction to generate a clear visual separation of different protein classes. Finally, functional classes are predicted by a combination of accurate compact set (CS) relation and interpretable one-nearest neighbor (1-NN) algorithm. Besides localization data sets, we also apply a human protein kinase set to validate generality of our proposed method. Experiment results demonstrate that our method make accurate prediction in a more efficient and interpretable manner. First, homology extension using a fast search on a compact database can greatly accelerate traditional running time up to twenty-five times faster without sacrificing prediction performance. This suggests that computational costs of many other predictors that also incorporate homology information can be largely reduced. In addition, CA can not only efficiently identify discriminative features but also provide a clear visualization of different functional classes. Moreover, predictions based on CS achieve 100% precision. When combined with 1-NN on unpredicted targets by CS, our method attains slightly better or comparable performance compared with the state-of-the-art systems.

摘要

预测蛋白质的功能类别，如定位点和修饰，在功能注释中起着至关重要的作用。由于高通量测序实验产生了大量的序列数据，因此对高效和可解释的预测策略的需求迅速增加。我们之前用于亚细胞定位预测的方法 PSLDoc 在革兰氏阴性菌中具有很高的整体准确性。然而，由于在特征提取中包含同源延伸和在特征降维中包含概率潜在语义分析，PSLDoc 的计算量很大。此外，支持向量机生成的预测结果准确但通常难以解释。在这项工作中，我们结合了三种新技术来提高效率和可解释性。首先，使用快速搜索模型对紧凑的非冗余数据库执行同源延伸，以减少运行时间。其次，我们将对应分析（CA）结合进来作为一种有效的特征降维方法，以生成不同蛋白质类别的清晰可视化分离。最后，通过准确紧凑集（CS）关系和可解释的最近邻（1-NN）算法的组合来预测功能类别。除了定位数据集，我们还应用了人类蛋白激酶数据集来验证我们提出的方法的通用性。实验结果表明，我们的方法以更高效和可解释的方式进行准确的预测。首先，使用紧凑数据库上的快速搜索进行同源延伸可以大大加速传统的运行时间，速度可提高至二十五倍，而不会牺牲预测性能。这表明许多其他也包含同源信息的预测器的计算成本可以大大降低。此外，CA 不仅可以有效地识别判别特征，还可以提供不同功能类别的清晰可视化。此外，基于 CS 的预测可以达到 100%的精度。当与 CS 对未预测目标的 1-NN 结合使用时，与最先进的系统相比，我们的方法在性能上略有提高或相当。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3cc8/3795737/fe4f3b59c0f5/pone.0075542.g001.jpg

相似文献

Efficient and interpretable prediction of protein functional classes by correspondence analysis and compact set relations.

PLoS One. 2013 Oct 11;8(10):e75542. doi: 10.1371/journal.pone.0075542. eCollection 2013.

Prediction of nuclear proteins using nuclear translocation signals proposed by probabilistic latent semantic indexing.

BMC Bioinformatics. 2012;13 Suppl 17(Suppl 17):S13. doi: 10.1186/1471-2105-13-S17-S13. Epub 2012 Dec 13.

Protein subcellular localization prediction based on compartment-specific features and structure conservation.

BMC Bioinformatics. 2007 Sep 8;8:330. doi: 10.1186/1471-2105-8-330.

PSLDoc: Protein subcellular localization prediction based on gapped-dipeptides and probabilistic latent semantic analysis.

Proteins. 2008 Aug;72(2):693-710. doi: 10.1002/prot.21944.

Incorporating functional inter-relationships into protein function prediction algorithms.

BMC Bioinformatics. 2009 May 12;10:142. doi: 10.1186/1471-2105-10-142.

SVM-Fold: a tool for discriminative multi-class protein fold and superfamily recognition.

BMC Bioinformatics. 2007 May 22;8 Suppl 4(Suppl 4):S2. doi: 10.1186/1471-2105-8-S4-S2.

Prediction of protein subcellular localization.

Proteins. 2006 Aug 15;64(3):643-51. doi: 10.1002/prot.21018.

Protein subcellular localization prediction for Gram-negative bacteria using amino acid subalphabets and a combination of multiple support vector machines.

BMC Bioinformatics. 2005 Jul 13;6:174. doi: 10.1186/1471-2105-6-174.

APIS: accurate prediction of hot spots in protein interfaces by combining protrusion index with solvent accessibility.

BMC Bioinformatics. 2010 Apr 8;11:174. doi: 10.1186/1471-2105-11-174.

引用本文的文献

GODoc: high-throughput protein function prediction using novel k-nearest-neighbor and voting algorithms.

BMC Bioinformatics. 2020 Nov 18;21(Suppl 6):276. doi: 10.1186/s12859-020-03556-9.

本文引用的文献

Prediction of nuclear proteins using nuclear translocation signals proposed by probabilistic latent semantic indexing.

BMC Bioinformatics. 2012;13 Suppl 17(Suppl 17):S13. doi: 10.1186/1471-2105-13-S17-S13. Epub 2012 Dec 13.

Using principal component analysis and support vector machine to predict protein structural class for low-similarity sequences via PSSM.

J Biomol Struct Dyn. 2012;29(6):634-42. doi: 10.1080/07391102.2011.672627.

Accurate multiple sequence alignment of transmembrane proteins with PSI-Coffee.

BMC Bioinformatics. 2012 Mar 28;13 Suppl 4(Suppl 4):S1. doi: 10.1186/1471-2105-13-S4-S1.

SATe-II: very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees.

Syst Biol. 2012 Jan;61(1):90-106. doi: 10.1093/sysbio/syr095. Epub 2011 Dec 1.

SignalP 4.0: discriminating signal peptides from transmembrane regions.

Nat Methods. 2011 Sep 29;8(10):785-6. doi: 10.1038/nmeth.1701.

PSORTb 3.0: improved protein subcellular localization prediction with refined localization subcategories and predictive capabilities for all prokaryotes.

Bioinformatics. 2010 Jul 1;26(13):1608-15. doi: 10.1093/bioinformatics/btq249. Epub 2010 May 13.

Protein interactions and ligand binding: from protein subfamilies to functional specificity.

Proc Natl Acad Sci U S A. 2010 Feb 2;107(5):1995-2000. doi: 10.1073/pnas.0908044107. Epub 2010 Jan 19.

Amino acid classification based spectrum kernel fusion for protein subnuclear localization.

BMC Bioinformatics. 2010 Jan 18;11 Suppl 1(Suppl 1):S17. doi: 10.1186/1471-2105-11-S1-S17.

Protein subcellular localization prediction of eukaryotes using a knowledge-based approach.

BMC Bioinformatics. 2009 Dec 3;10 Suppl 15(Suppl 15):S8. doi: 10.1186/1471-2105-10-S15-S8.

Upcoming challenges for multiple sequence alignment methods in the high-throughput era.

Bioinformatics. 2009 Oct 1;25(19):2455-65. doi: 10.1093/bioinformatics/btp452. Epub 2009 Jul 30.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

Suppr超能文献

基于对应分析和紧致集关系的高效可解释蛋白质功能类预测。

Efficient and interpretable prediction of protein functional classes by correspondence analysis and compact set relations.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译