使用支持向量机结合选定的蛋白质序列和结构特性预测催化残基。

Prediction of catalytic residues using Support Vector Machine with selected protein sequence and structural properties.

作者信息

Petrova Natalia V, Wu Cathy H

机构信息

Protein Information Resource, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, Washington, DC 20007, USA.

出版信息

BMC Bioinformatics. 2006 Jun 21;7:312. doi: 10.1186/1471-2105-7-312.

DOI:10.1186/1471-2105-7-312

PMID:16790052

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC1534064/

Abstract

BACKGROUND

The number of protein sequences deriving from genome sequencing projects is outpacing our knowledge about the function of these proteins. With the gap between experimentally characterized and uncharacterized proteins continuing to widen, it is necessary to develop new computational methods and tools for functional prediction. Knowledge of catalytic sites provides a valuable insight into protein function. Although many computational methods have been developed to predict catalytic residues and active sites, their accuracy remains low, with a significant number of false positives. In this paper, we present a novel method for the prediction of catalytic sites, using a carefully selected, supervised machine learning algorithm coupled with an optimal discriminative set of protein sequence conservation and structural properties.

RESULTS

To determine the best machine learning algorithm, 26 classifiers in the WEKA software package were compared using a benchmarking dataset of 79 enzymes with 254 catalytic residues in a 10-fold cross-validation analysis. Each residue of the dataset was represented by a set of 24 residue properties previously shown to be of functional relevance, as well as a label {+1/-1} to indicate catalytic/non-catalytic residue. The best-performing algorithm was the Sequential Minimal Optimization (SMO) algorithm, which is a Support Vector Machine (SVM). The Wrapper Subset Selection algorithm further selected seven of the 24 attributes as an optimal subset of residue properties, with sequence conservation, catalytic propensities of amino acids, and relative position on protein surface being the most important features.

CONCLUSION

The SMO algorithm with 7 selected attributes correctly predicted 228 of the 254 catalytic residues, with an overall predictive accuracy of more than 86%. Missing only 10.2% of the catalytic residues, the method captures the fundamental features of catalytic residues and can be used as a "catalytic residue filter" to facilitate experimental identification of catalytic residues for proteins with known structure but unknown function.

摘要

背景

基因组测序项目产生的蛋白质序列数量超过了我们对这些蛋白质功能的了解。随着已通过实验表征的蛋白质与未表征蛋白质之间的差距不断扩大，开发新的计算方法和工具进行功能预测变得很有必要。催化位点的知识为了解蛋白质功能提供了宝贵的见解。尽管已经开发了许多计算方法来预测催化残基和活性位点，但其准确性仍然较低，存在大量误报。在本文中，我们提出了一种预测催化位点的新方法，该方法使用精心挑选的监督机器学习算法，并结合一组最优的蛋白质序列保守性和结构特性判别指标。

结果

为了确定最佳的机器学习算法，在10折交叉验证分析中，使用了包含79种酶和254个催化残基的基准数据集，对WEKA软件包中的26种分类器进行了比较。数据集中的每个残基由一组先前已证明具有功能相关性的24种残基特性表示，以及一个标签{+1/-1}来指示催化/非催化残基。表现最佳的算法是序列最小优化（SMO）算法，它是一种支持向量机（SVM）。包装器子集选择算法进一步从24个属性中选择了7个作为残基特性的最优子集，序列保守性、氨基酸的催化倾向以及在蛋白质表面的相对位置是最重要的特征。

结论

具有7个选定属性的SMO算法正确预测了254个催化残基中的228个，总体预测准确率超过86%。该方法仅遗漏了10.2%的催化残基，捕捉到了催化残基的基本特征，可作为“催化残基过滤器”，便于对结构已知但功能未知的蛋白质进行催化残基的实验鉴定。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4436/1534064/07b7203ee576/1471-2105-7-312-1.jpg

相似文献

Prediction of catalytic residues using Support Vector Machine with selected protein sequence and structural properties.使用支持向量机结合选定的蛋白质序列和结构特性预测催化残基。

BMC Bioinformatics. 2006 Jun 21;7:312. doi: 10.1186/1471-2105-7-312.

Identification of catalytic residues from protein structure using support vector machine with sequence and structural features.利用具有序列和结构特征的支持向量机从蛋白质结构中鉴定催化残基。

Biochem Biophys Res Commun. 2008 Mar 14;367(3):630-4. doi: 10.1016/j.bbrc.2008.01.038. Epub 2008 Jan 17.

Accurate sequence-based prediction of catalytic residues.基于序列的催化残基精确预测。

Bioinformatics. 2008 Oct 15;24(20):2329-38. doi: 10.1093/bioinformatics/btn433. Epub 2008 Aug 18.

PREvaIL, an integrative approach for inferring catalytic residues using sequence, structural, and network features in a machine-learning framework.PREvaIL，一种基于机器学习框架，使用序列、结构和网络特征推断催化残基的综合方法。

J Theor Biol. 2018 Apr 14;443:125-137. doi: 10.1016/j.jtbi.2018.01.023. Epub 2018 Feb 1.

Automated method for predicting enzyme functional surfaces and locating key residues with accuracy and specificity.用于准确且特异地预测酶功能表面并定位关键残基的自动化方法。

Conf Proc IEEE Eng Med Biol Soc. 2006;2006:4552-5. doi: 10.1109/IEMBS.2006.259540.

Evaluation of features for catalytic residue prediction in novel folds.新型折叠中催化残基预测特征的评估。

Protein Sci. 2007 Feb;16(2):216-26. doi: 10.1110/ps.062523907. Epub 2006 Dec 22.

Enhanced performance in prediction of protein active sites with THEMATICS and support vector machines.利用THEMATICS和支持向量机提高蛋白质活性位点预测性能。

Protein Sci. 2008 Feb;17(2):333-41. doi: 10.1110/ps.073213608. Epub 2007 Dec 20.

HemeBIND: a novel method for heme binding residue prediction by combining structural and sequence information.HemeBIND：一种通过结合结构和序列信息预测血红素结合残基的新方法。

BMC Bioinformatics. 2011 May 26;12:207. doi: 10.1186/1471-2105-12-207.

Prediction of active sites of enzymes by maximum relevance minimum redundancy (mRMR) feature selection.通过最大相关最小冗余（mRMR）特征选择预测酶的活性位点。

Mol Biosyst. 2013 Jan 27;9(1):61-9. doi: 10.1039/c2mb25327e. Epub 2012 Nov 2.

Glycosylation site prediction using ensembles of Support Vector Machine classifiers.使用支持向量机分类器集成进行糖基化位点预测。

BMC Bioinformatics. 2007 Nov 9;8:438. doi: 10.1186/1471-2105-8-438.

引用本文的文献

EzSEA: an interactive web interface for enzyme sequence evolution analysis.EzSEA：用于酶序列进化分析的交互式网络界面。

Bioinform Adv. 2025 May 20;5(1):vbaf118. doi: 10.1093/bioadv/vbaf118. eCollection 2025.

SCREEN: A Graph-based Contrastive Learning Tool to Infer Catalytic Residues and Assess Enzyme Mutations.SCREEN：一种基于图的对比学习工具，用于推断催化残基和评估酶突变

Genomics Proteomics Bioinformatics. 2025 Jan 15;22(6). doi: 10.1093/gpbjnl/qzae094.

Accurately predicting enzyme functions through geometric graph learning on ESMFold-predicted structures.通过在 ESMFold 预测结构上进行几何图形学习，准确预测酶功能。

Nat Commun. 2024 Sep 18;15(1):8180. doi: 10.1038/s41467-024-52533-w.

Multi-modal deep learning enables efficient and accurate annotation of enzymatic active sites.多模态深度学习可实现酶活性位点的高效准确标注。

Nat Commun. 2024 Aug 27;15(1):7348. doi: 10.1038/s41467-024-51511-6.

Enzyme function and evolution through the lens of bioinformatics.通过生物信息学的视角研究酶的功能和进化。

Biochem J. 2023 Nov 29;480(22):1845-1863. doi: 10.1042/BCJ20220405.

RPpocket: An RNA-Protein Intuitive Database with RNA Pocket Topology Resources.RPpocket：一个具有 RNA 口袋拓扑结构资源的 RNA-蛋白质直观数据库。

Int J Mol Sci. 2022 Jun 21;23(13):6903. doi: 10.3390/ijms23136903.

Machine learning for enzyme engineering, selection and design.机器学习在酶工程、选择和设计中的应用。

Protein Eng Des Sel. 2021 Feb 15;34. doi: 10.1093/protein/gzab019.

Computational Methods for Predicting Functions at the mRNA Isoform Level.计算方法预测 mRNA 异构体水平的功能。

Int J Mol Sci. 2020 Aug 8;21(16):5686. doi: 10.3390/ijms21165686.

Intraoperative Margin Assessment in Oral and Oropharyngeal Cancer Using Label-Free Fluorescence Lifetime Imaging and Machine Learning.使用无标记荧光寿命成像和机器学习评估口腔和口咽癌的术中切缘。

IEEE Trans Biomed Eng. 2021 Mar;68(3):857-868. doi: 10.1109/TBME.2020.3010480. Epub 2021 Feb 18.

Coupling dynamics and evolutionary information with structure to identify protein regulatory and functional binding sites.将耦合动力学和进化信息与结构相结合，以识别蛋白质的调节和功能结合位点。

Proteins. 2019 Oct;87(10):850-868. doi: 10.1002/prot.25749. Epub 2019 Jun 22.

本文引用的文献

Evolutionary trace residues in noroviruses: importance in receptor binding, antigenicity, virion assembly, and strain diversity.诺如病毒中的进化追踪残基：在受体结合、抗原性、病毒粒子组装及毒株多样性方面的重要性

J Virol. 2005 Jan;79(1):554-68. doi: 10.1128/JVI.79.1.554-568.2005.

Searching for functional sites in protein structures.在蛋白质结构中寻找功能位点。

Curr Opin Chem Biol. 2004 Feb;8(1):3-7. doi: 10.1016/j.cbpa.2003.11.001.

Prediction of functional sites by analysis of sequence and structure conservation.通过序列和结构保守性分析预测功能位点。

Protein Sci. 2004 Apr;13(4):884-92. doi: 10.1110/ps.03465504. Epub 2004 Mar 9.

Evolutionary trace analysis of scorpion toxins specific for K-channels.钾通道特异性蝎毒素的进化追踪分析

Proteins. 2004 Feb 1;54(2):361-70. doi: 10.1002/prot.10588.

Automatic prediction of protein function.蛋白质功能的自动预测

Cell Mol Life Sci. 2003 Dec;60(12):2637-50. doi: 10.1007/s00018-003-3114-8.

SCOP database in 2004: refinements integrate structure and sequence family data.2004年的SCOP数据库：改进整合了结构和序列家族数据。

Nucleic Acids Res. 2004 Jan 1;32(Database issue):D226-9. doi: 10.1093/nar/gkh039.

How well is enzyme function conserved as a function of pairwise sequence identity?酶功能作为成对序列同一性的函数，其保守程度如何？

J Mol Biol. 2003 Oct 31;333(4):863-82. doi: 10.1016/j.jmb.2003.08.057.

Identification of protein biochemical functions by similarity search using the molecular surface database eF-site.利用分子表面数据库eF-site通过相似性搜索鉴定蛋白质生化功能。

Protein Sci. 2003 Aug;12(8):1589-95. doi: 10.1110/ps.0368703.

Using a neural network and spatial clustering to predict the location of active sites in enzymes.利用神经网络和空间聚类预测酶中活性位点的位置。

J Mol Biol. 2003 Jul 18;330(4):719-34. doi: 10.1016/s0022-2836(03)00515-1.

Ligand binding: functional site location, similarity and docking.配体结合：功能位点定位、相似性与对接

Curr Opin Struct Biol. 2003 Jun;13(3):389-95. doi: 10.1016/s0959-440x(03)00075-7.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

使用支持向量机结合选定的蛋白质序列和结构特性预测催化残基。

Prediction of catalytic residues using Support Vector Machine with selected protein sequence and structural properties.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献