Kuznetsov Igor B, Gou Zhenkun, Li Run, Hwang Seungwoo
Gen*NY*sis Center for Excellence in Cancer Genomics, Department of Epidemiology and Biostatistics, University at Albany, Rensselaer, NewYork 12144, USA.
Proteins. 2006 Jul 1;64(1):19-27. doi: 10.1002/prot.20977.
Proteins that interact with DNA are involved in a number of fundamental biological activities such as DNA replication, transcription, and repair. A reliable identification of DNA-binding sites in DNA-binding proteins is important for functional annotation, site-directed mutagenesis, and modeling protein-DNA interactions. We apply Support Vector Machine (SVM), a supervised pattern recognition method, to predict DNA-binding sites in DNA-binding proteins using the following features: amino acid sequence, profile of evolutionary conservation of sequence positions, and low-resolution structural information. We use a rigorous statistical approach to study the performance of predictors that utilize different combinations of features and how this performance is affected by structural and sequence properties of proteins. Our results indicate that an SVM predictor based on a properly scaled profile of evolutionary conservation in the form of a position specific scoring matrix (PSSM) significantly outperforms a PSSM-based neural network predictor. The highest accuracy is achieved by SVM predictor that combines the profile of evolutionary conservation with low-resolution structural information. Our results also show that knowledge-based predictors of DNA-binding sites perform significantly better on proteins from mainly-alpha structural class and that the performance of these predictors is significantly correlated with certain structural and sequence properties of proteins. These observations suggest that it may be possible to assign a reliability index to the overall accuracy of the prediction of DNA-binding sites in any given protein using its sequence and structural properties. A web-server implementation of the predictors is freely available online at http://lcg.rit.albany.edu/dp-bind/.
与DNA相互作用的蛋白质参与了许多基本的生物学活动,如DNA复制、转录和修复。可靠识别DNA结合蛋白中的DNA结合位点对于功能注释、定点诱变以及模拟蛋白质与DNA的相互作用至关重要。我们应用支持向量机(SVM)这一有监督的模式识别方法,利用以下特征预测DNA结合蛋白中的DNA结合位点:氨基酸序列、序列位置的进化保守性概况以及低分辨率结构信息。我们采用严格的统计方法来研究利用不同特征组合的预测器的性能,以及这种性能如何受到蛋白质的结构和序列特性的影响。我们的结果表明,基于位置特异性评分矩阵(PSSM)形式的适当缩放的进化保守性概况的SVM预测器显著优于基于PSSM的神经网络预测器。结合进化保守性概况与低分辨率结构信息的SVM预测器实现了最高的准确率。我们的结果还表明,基于知识的DNA结合位点预测器在主要为α结构类的蛋白质上表现明显更好,并且这些预测器的性能与蛋白质的某些结构和序列特性显著相关。这些观察结果表明,利用蛋白质的序列和结构特性,有可能为任何给定蛋白质中DNA结合位点预测的总体准确性分配一个可靠性指标。预测器的网络服务器实现可在http://lcg.rit.albany.edu/dp-bind/上免费在线获取。