Bhardwaj Nitin, Lu Hui
Bioinformatics Program, Department of Bioengineering, University of Illinois at Chicago, Chicago, IL 60607, USA.
FEBS Lett. 2007 Mar 6;581(5):1058-66. doi: 10.1016/j.febslet.2007.01.086. Epub 2007 Feb 7.
Protein-DNA interactions are crucial to many cellular activities such as expression-control and DNA-repair. These interactions between amino acids and nucleotides are highly specific and any aberrance at the binding site can render the interaction completely incompetent. In this study, we have three aims focusing on DNA-binding residues on the protein surface: to develop an automated approach for fast and reliable recognition of DNA-binding sites; to improve the prediction by distance-dependent refinement; use these predictions to identify DNA-binding proteins. We use a support vector machines (SVM)-based approach to harness the features of the DNA-binding residues to distinguish them from non-binding residues. Features used for distinction include the residue's identity, charge, solvent accessibility, average potential, the secondary structure it is embedded in, neighboring residues, and location in a cationic patch. These features collected from 50 proteins are used to train SVM. Testing is then performed on another set of 37 proteins, much larger than any testing set used in previous studies. The testing set has no more than 20% sequence identity not only among its pairs, but also with the proteins in the training set, thus removing any undesired redundancy due to homology. This set also has proteins with an unseen DNA-binding structural class not present in the training set. With the above features, an accuracy of 66% with balanced sensitivity and specificity is achieved without relying on homology or evolutionary information. We then develop a post-processing scheme to improve the prediction using the relative location of the predicted residues. Balanced success is then achieved with average sensitivity, specificity and accuracy pegged at 71.3%, 69.3% and 70.5%, respectively. Average net prediction is also around 70%. Finally, we show that the number of predicted DNA-binding residues can be used to differentiate DNA-binding proteins from non-DNA-binding proteins with an accuracy of 78%. Results presented here demonstrate that machine-learning can be applied to automated identification of DNA-binding residues and that the success rate can be ameliorated as more features are added. Such functional site prediction protocols can be useful in guiding consequent works such as site-directed mutagenesis and macromolecular docking.
蛋白质与DNA的相互作用对许多细胞活动至关重要,如表达调控和DNA修复。氨基酸与核苷酸之间的这些相互作用具有高度特异性,结合位点的任何异常都可能导致相互作用完全失效。在本研究中,我们有三个目标聚焦于蛋白质表面的DNA结合残基:开发一种自动方法以快速可靠地识别DNA结合位点;通过距离依赖性优化来改进预测;利用这些预测来识别DNA结合蛋白。我们使用基于支持向量机(SVM)的方法来利用DNA结合残基的特征,将它们与非结合残基区分开来。用于区分的特征包括残基的身份、电荷、溶剂可及性、平均电势、其所在的二级结构、相邻残基以及在阳离子斑块中的位置。从50种蛋白质中收集的这些特征用于训练支持向量机。然后在另一组37种蛋白质上进行测试,该测试集比以往研究中使用的任何测试集都要大得多。该测试集不仅其配对之间的序列同一性不超过20%,而且与训练集中的蛋白质也不超过20%,从而消除了由于同源性导致的任何不必要的冗余。该测试集还包含训练集中不存在的具有未见DNA结合结构类别的蛋白质。利用上述特征,在不依赖同源性或进化信息的情况下,实现了平衡敏感性和特异性的66%的准确率。然后我们开发了一种后处理方案,利用预测残基的相对位置来改进预测。随后实现了平衡成功,平均敏感性、特异性和准确率分别为71.3%、69.3%和70.5%。平均净预测也约为70%。最后,我们表明预测的DNA结合残基数量可用于区分DNA结合蛋白和非DNA结合蛋白,准确率为78%。这里呈现的结果表明机器学习可应用于DNA结合残基的自动识别,并且随着添加更多特征成功率可以得到改善。这种功能位点预测方案可用于指导后续工作,如定点诱变和大分子对接。