School of Information and Computer, Anhui Agricultural University, Hefei, 230036, Anhui, China.
Anhui Provincial Engineering Laboratory for Beidou Precision Agriculture Information, Anhui Agricultural University, Hefei, 230036, Anhui, China.
BMC Bioinformatics. 2023 Apr 4;24(1):129. doi: 10.1186/s12859-023-05263-7.
Identification of hot spots in protein-DNA binding interfaces is extremely important for understanding the underlying mechanisms of protein-DNA interactions and drug design. Since experimental methods for identifying hot spots are time-consuming and expensive, and most of the existing computational methods are based on traditional protein-DNA features to predict hot spots, unable to make full use of the effective information in the features.
In this work, a method named WTL-PDH is proposed for hot spots prediction. To deal with the unbalanced dataset, we used the Synthetic Minority Over-sampling Technique to generate minority class samples to achieve the balance of dataset. First, we extracted the solvent accessible surface area features and structural features, and then processed the traditional features using discrete wavelet transform and wavelet packet transform to extract the wavelet energy information and wavelet entropy information, and obtained a total of 175 dimensional features. In order to obtain the best feature subset, we systematically evaluate these features in various feature selection strategies. Finally, light gradient boosting machine (LightGBM) was used to establish the model.
Our method achieved good results on independent test set with AUC, MCC and F1 scores of 0.838, 0.533 and 0.750, respectively. WTL-PDH can achieve generally better performance in predicting hot spots when compared with state-of-the-art methods. The dataset and source code are available at https://github.com/chase2555/WTL-PDH .
识别蛋白质-DNA 结合界面的热点对于理解蛋白质-DNA 相互作用的基本机制和药物设计非常重要。由于识别热点的实验方法既耗时又昂贵,并且大多数现有的计算方法都是基于传统的蛋白质-DNA 特征来预测热点,无法充分利用特征中的有效信息。
在这项工作中,提出了一种名为 WTL-PDH 的方法来进行热点预测。为了解决不平衡数据集的问题,我们使用了合成少数过采样技术来生成少数类样本,以实现数据集的平衡。首先,我们提取了溶剂可及表面积特征和结构特征,然后使用离散小波变换和小波包变换对传统特征进行处理,以提取小波能量信息和小波熵信息,总共获得了 175 维特征。为了获得最佳的特征子集,我们在各种特征选择策略中系统地评估了这些特征。最后,使用轻梯度提升机(LightGBM)建立模型。
我们的方法在独立测试集上取得了良好的效果,AUC、MCC 和 F1 得分分别为 0.838、0.533 和 0.750。与最先进的方法相比,WTL-PDH 在预测热点方面通常可以取得更好的性能。数据集和源代码可在 https://github.com/chase2555/WTL-PDH 上获得。