Suppr超能文献

使用支持向量机结合选定的蛋白质序列和结构特性预测催化残基。

Prediction of catalytic residues using Support Vector Machine with selected protein sequence and structural properties.

作者信息

Petrova Natalia V, Wu Cathy H

机构信息

Protein Information Resource, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, Washington, DC 20007, USA.

出版信息

BMC Bioinformatics. 2006 Jun 21;7:312. doi: 10.1186/1471-2105-7-312.

Abstract

BACKGROUND

The number of protein sequences deriving from genome sequencing projects is outpacing our knowledge about the function of these proteins. With the gap between experimentally characterized and uncharacterized proteins continuing to widen, it is necessary to develop new computational methods and tools for functional prediction. Knowledge of catalytic sites provides a valuable insight into protein function. Although many computational methods have been developed to predict catalytic residues and active sites, their accuracy remains low, with a significant number of false positives. In this paper, we present a novel method for the prediction of catalytic sites, using a carefully selected, supervised machine learning algorithm coupled with an optimal discriminative set of protein sequence conservation and structural properties.

RESULTS

To determine the best machine learning algorithm, 26 classifiers in the WEKA software package were compared using a benchmarking dataset of 79 enzymes with 254 catalytic residues in a 10-fold cross-validation analysis. Each residue of the dataset was represented by a set of 24 residue properties previously shown to be of functional relevance, as well as a label {+1/-1} to indicate catalytic/non-catalytic residue. The best-performing algorithm was the Sequential Minimal Optimization (SMO) algorithm, which is a Support Vector Machine (SVM). The Wrapper Subset Selection algorithm further selected seven of the 24 attributes as an optimal subset of residue properties, with sequence conservation, catalytic propensities of amino acids, and relative position on protein surface being the most important features.

CONCLUSION

The SMO algorithm with 7 selected attributes correctly predicted 228 of the 254 catalytic residues, with an overall predictive accuracy of more than 86%. Missing only 10.2% of the catalytic residues, the method captures the fundamental features of catalytic residues and can be used as a "catalytic residue filter" to facilitate experimental identification of catalytic residues for proteins with known structure but unknown function.

摘要

背景

基因组测序项目产生的蛋白质序列数量超过了我们对这些蛋白质功能的了解。随着已通过实验表征的蛋白质与未表征蛋白质之间的差距不断扩大,开发新的计算方法和工具进行功能预测变得很有必要。催化位点的知识为了解蛋白质功能提供了宝贵的见解。尽管已经开发了许多计算方法来预测催化残基和活性位点,但其准确性仍然较低,存在大量误报。在本文中,我们提出了一种预测催化位点的新方法,该方法使用精心挑选的监督机器学习算法,并结合一组最优的蛋白质序列保守性和结构特性判别指标。

结果

为了确定最佳的机器学习算法,在10折交叉验证分析中,使用了包含79种酶和254个催化残基的基准数据集,对WEKA软件包中的26种分类器进行了比较。数据集中的每个残基由一组先前已证明具有功能相关性的24种残基特性表示,以及一个标签{+1/-1}来指示催化/非催化残基。表现最佳的算法是序列最小优化(SMO)算法,它是一种支持向量机(SVM)。包装器子集选择算法进一步从24个属性中选择了7个作为残基特性的最优子集,序列保守性、氨基酸的催化倾向以及在蛋白质表面的相对位置是最重要的特征。

结论

具有7个选定属性的SMO算法正确预测了254个催化残基中的228个,总体预测准确率超过86%。该方法仅遗漏了10.2%的催化残基,捕捉到了催化残基的基本特征,可作为“催化残基过滤器”,便于对结构已知但功能未知的蛋白质进行催化残基的实验鉴定。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4436/1534064/07b7203ee576/1471-2105-7-312-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验