Gao Yu-Fei, Li Bi-Qing, Cai Yu-Dong, Feng Kai-Yan, Li Zhan-Dong, Jiang Yang
Department of Surgery, China-Japan Union Hospital of Jilin University, Changchun, People's Republic of China.
Mol Biosyst. 2013 Jan 27;9(1):61-9. doi: 10.1039/c2mb25327e. Epub 2012 Nov 2.
Identification of catalytic residues plays a key role in understanding how enzymes work. Although numerous computational methods have been developed to predict catalytic residues and active sites, the prediction accuracy remains relatively low with high false positives. In this work, we developed a novel predictor based on the Random Forest algorithm (RF) aided by the maximum relevance minimum redundancy (mRMR) method and incremental feature selection (IFS). We incorporated features of physicochemical/biochemical properties, sequence conservation, residual disorder, secondary structure and solvent accessibility to predict active sites of enzymes and achieved an overall accuracy of 0.885687 and MCC of 0.689226 on an independent test dataset. Feature analysis showed that every category of the features except disorder contributed to the identification of active sites. It was also shown via the site-specific feature analysis that the features derived from the active site itself contributed most to the active site determination. Our prediction method may become a useful tool for identifying the active sites and the key features identified by the paper may provide valuable insights into the mechanism of catalysis.
催化残基的识别对于理解酶的工作原理起着关键作用。尽管已经开发了许多计算方法来预测催化残基和活性位点,但预测准确率仍然相对较低,假阳性率较高。在这项工作中,我们开发了一种基于随机森林算法(RF)的新型预测器,并借助最大相关最小冗余(mRMR)方法和增量特征选择(IFS)。我们纳入了物理化学/生化性质、序列保守性、残基无序性、二级结构和溶剂可及性等特征来预测酶的活性位点,在一个独立测试数据集上实现了0.885687的总体准确率和0.689226的马修斯相关系数(MCC)。特征分析表明,除无序性外的每一类特征都有助于活性位点的识别。通过位点特异性特征分析还表明,源自活性位点本身的特征对活性位点的确定贡献最大。我们的预测方法可能成为识别活性位点的有用工具,并且本文所识别的关键特征可能为催化机制提供有价值的见解。