Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, Institutes of Physical Science and Information Technology, Anhui University, Hefei, 230601, Anhui, China.
School of Computer Science and Technology, Anhui University, Hefei, 230601, Anhui, China.
Interdiscip Sci. 2021 Mar;13(1):1-11. doi: 10.1007/s12539-020-00399-z. Epub 2020 Oct 17.
Hot spot residues at protein-DNA binding interfaces are hugely important for investigating the underlying mechanism of molecular recognition. Currently, there are a few tools available for identifying the hot spot residues in the protein-DNA complexes. In addition, the three-dimensional protein structures are needed in these tools. However, it is well known that the three-dimensional structures are unavailable for most proteins. Considering the limitation, we proposed a method, named SPDH, for predicting hot spot residues only based on protein sequences. Firstly, we obtained 133 features from physicochemical property, conservation, predicted solvent accessible surface area and structure. Then, we systematically assessed these features based on various feature selection methods to obtain the optimal feature subset and compared the models using four classical machine learning algorithms (support vector machine, random forest, logistic regression, and k-nearest neighbor) on the training dataset. We found that the variability of physicochemical property features between wild and mutative types was important on improving the performance of the prediction model. On the independent test set, our method achieved the performance with AUC of 0.760 and sensitivity of 0.808, and outperformed other methods. The data and source code can be downloaded at https://github.com/xialab-ahu/SPDH .
蛋白质- DNA 结合界面上的热点残基对于研究分子识别的基本机制非常重要。目前,有一些工具可用于识别蛋白质-DNA 复合物中的热点残基。此外,这些工具还需要三维蛋白质结构。然而,众所周知,大多数蛋白质的三维结构是不可用的。考虑到这一限制,我们提出了一种仅基于蛋白质序列预测热点残基的方法,命名为 SPDH。首先,我们从理化性质、保守性、预测溶剂可及表面积和结构中获得了 133 个特征。然后,我们基于各种特征选择方法系统地评估了这些特征,以获得最优的特征子集,并在训练数据集上使用四种经典机器学习算法(支持向量机、随机森林、逻辑回归和 k-最近邻)比较了模型。我们发现,野生型和突变型之间理化性质特征的可变性对于提高预测模型的性能很重要。在独立测试集上,我们的方法的 AUC 为 0.760,敏感性为 0.808,优于其他方法。数据和源代码可以在 https://github.com/xialab-ahu/SPDH 上下载。