Huang Ruiyan, Qiu Wangren, Xiao Xuan, Lin Weizhong
School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen Jiangxi, China.
School of Information Engineering, Jingxi Art & Ceramics Technology Institute, Jingdezhen Jiangxi, China.
PLoS One. 2025 May 13;20(5):e0320817. doi: 10.1371/journal.pone.0320817. eCollection 2025.
Protein-DNA interactions play a crucial role in cellular biology, essential for maintaining life processes and regulating cellular functions. We propose a method called iProtDNA-SMOTE, which utilizes non-equilibrium graph neural networks along with pre-trained protein language models to predict DNA binding residues. This approach effectively addresses the class imbalance issue in predicting protein-DNA binding sites by leveraging unbalanced graph data, thus enhancing model's generalization and specificity. We trained the model on two datasets, TR646 and TR573, and conducted a series of experiments to evaluate its performance. The model achieved AUC values of 0.850, 0.896, and 0.858 on the independent test datasets TE46, TE129, and TE181, respectively. These results indicate that iProtDNA-SMOTE outperforms existing methods in terms of accuracy and generalization for predicting DNA binding sites, offering reliable and effective predictions to minimize errors. The model has been thoroughly validated for its ability to predict protein-DNA binding sites with high reliability and precision. For the convenience of the scientific community, the benchmark datasets and codes are publicly available at https://github.com/primrosehry/iProtDNA-SMOTE.
蛋白质与DNA的相互作用在细胞生物学中起着至关重要的作用,对维持生命过程和调节细胞功能必不可少。我们提出了一种名为iProtDNA-SMOTE的方法,该方法利用非平衡图神经网络以及预训练的蛋白质语言模型来预测DNA结合残基。这种方法通过利用不平衡的图数据有效地解决了预测蛋白质-DNA结合位点时的类别不平衡问题,从而提高了模型的泛化能力和特异性。我们在TR646和TR573两个数据集上训练了该模型,并进行了一系列实验来评估其性能。该模型在独立测试数据集TE46、TE129和TE181上分别取得了0.850、0.896和0.858的AUC值。这些结果表明,在预测DNA结合位点的准确性和泛化能力方面,iProtDNA-SMOTE优于现有方法,能够提供可靠有效的预测以尽量减少误差。该模型预测蛋白质-DNA结合位点的能力已得到充分验证,具有高可靠性和精确性。为方便科学界使用,基准数据集和代码可在https://github.com/primrosehry/iProtDNA-SMOTE上公开获取。