IEEE/ACM Trans Comput Biol Bioinform. 2017 Nov-Dec;14(6):1389-1398. doi: 10.1109/TCBB.2016.2616469. Epub 2016 Oct 11.
Protein-DNA interactions are ubiquitous in a wide variety of biological processes. Correctly locating DNA-binding residues solely from protein sequences is an important but challenging task for protein function annotations and drug discovery, especially in the post-genomic era where large volumes of protein sequences have quickly accumulated. In this study, we report a new predictor, named TargetDNA, for targeting protein-DNA binding residues from primary sequences. TargetDNA uses a protein's evolutionary information and its predicted solvent accessibility as two base features and employs a centered linear kernel alignment algorithm to learn the weights for weightedly combining the two features. Based on the weightedly combined feature, multiple initial predictors with SVM as classifiers are trained by applying a random under-sampling technique to the original dataset, the purpose of which is to cope with the severe imbalance phenomenon that exists between the number of DNA-binding and non-binding residues. The final ensembled predictor is obtained by boosting the multiple initially trained predictors. Experimental simulation results demonstrate that the proposed TargetDNA achieves a high prediction performance and outperforms many existing sequence-based protein-DNA binding residue predictors. The TargetDNA web server and datasets are freely available at http://csbio.njust.edu.cn/bioinf/TargetDNA/ for academic use.
蛋白质与 DNA 的相互作用在各种生物过程中普遍存在。仅从蛋白质序列正确定位 DNA 结合残基是蛋白质功能注释和药物发现的一项重要但具有挑战性的任务,特别是在后基因组时代,大量的蛋白质序列迅速积累。在这项研究中,我们报告了一种新的预测器,名为 TargetDNA,用于从原始序列中预测靶向蛋白质-DNA 结合残基。TargetDNA 使用蛋白质的进化信息及其预测的溶剂可及性作为两个基本特征,并采用中心线性核对齐算法来学习加权组合这两个特征的权重。基于加权组合特征,通过对原始数据集应用随机欠采样技术,使用支持向量机 (SVM) 作为分类器对多个初始预测器进行训练,其目的是应对 DNA 结合和非结合残基数量之间存在的严重不平衡现象。最终的集成预测器通过提升多个最初训练的预测器来获得。实验模拟结果表明,所提出的 TargetDNA 具有较高的预测性能,优于许多现有的基于序列的蛋白质-DNA 结合残基预测器。TargetDNA 网络服务器和数据集可在 http://csbio.njust.edu.cn/bioinf/TargetDNA/ 免费供学术使用。