Hu Jun, Bai Yan-Song, Zheng Lin-Lin, Jia Ning-Xin, Yu Dong-Jun, Zhang Gui-Jun
IEEE/ACM Trans Comput Biol Bioinform. 2022 Nov-Dec;19(6):3635-3645. doi: 10.1109/TCBB.2021.3123828. Epub 2022 Dec 8.
Protein-DNA interactions play an important role in diverse biological processes. Accurately identifying protein-DNA binding residues is a critical but challenging task for protein function annotations and drug design. Although wet-lab experimental methods are the most accurate way to identify protein-DNA binding residues, they are time consuming and labor intensive. There is an urgent need to develop computational methods to rapidly and accurately predict protein-DNA binding residues. In this study, we propose a novel sequence-based method, named PredDBR, for predicting DNA-binding residues. In PredDBR, for each query protein, its position-specific frequency matrix (PSFM), predicted secondary structure (PSS), and predicted probabilities of ligand-binding residues (PPLBR) are first generated as three feature sources. Secondly, for each feature source, the sliding window technique is employed to extract the matrix-format feature of each residue. Then, we design two strategies, i.e., square root (SR) and average (AVE), to separately transform PSFM-based and two predicted feature source-based, i.e., PSS-based and PPLBR-based, matrix-format features of each residue into three corresponding cube-format features. Finally, after serially combining the three cube-format features, the ensemble classifier is generated via applying bagging strategy to multiple base classifiers built by the framework of 2D convolutional neural network. The computational experimental results demonstrate that the proposed PredDBR achieves an average overall accuracy of 93.7% and a Mathew's correlation coefficient of 0.405 on two independent validation datasets and outperforms several state-of-the-art sequenced-based protein-DNA binding residue predictors. The PredDBR web-server is available at https://jun-csbio.github.io/PredDBR/.
蛋白质与DNA的相互作用在多种生物过程中发挥着重要作用。准确识别蛋白质与DNA的结合残基对于蛋白质功能注释和药物设计而言是一项关键但具有挑战性的任务。尽管湿实验室实验方法是识别蛋白质与DNA结合残基最准确的方式,但它们耗时且费力。迫切需要开发计算方法来快速、准确地预测蛋白质与DNA的结合残基。在本研究中,我们提出了一种名为PredDBR的基于序列的新方法,用于预测DNA结合残基。在PredDBR中,对于每个查询蛋白质,首先生成其位置特异性频率矩阵(PSFM)、预测二级结构(PSS)以及配体结合残基的预测概率(PPLBR)作为三个特征源。其次,对于每个特征源,采用滑动窗口技术提取每个残基的矩阵格式特征。然后,我们设计了两种策略,即平方根(SR)和平均值(AVE),分别将基于PSFM的以及基于两个预测特征源(即基于PSS和基于PPLBR)的每个残基的矩阵格式特征转换为三个相应的立方体格式特征。最后,在依次组合这三个立方体格式特征之后,通过将装袋策略应用于由二维卷积神经网络框架构建的多个基分类器来生成集成分类器。计算实验结果表明,所提出的PredDBR在两个独立验证数据集上实现了93.7%的平均总体准确率和0.405的马修斯相关系数,并且优于几种基于序列的蛋白质与DNA结合残基预测的最新方法。PredDBR网络服务器可在https://jun-csbio.github.io/PredDBR/获取。