Kumar K Krishna, Pugalenthi Ganesan, Suganthan P N
Institute for Neuro- and Bioinformatics, University of Lubeck, Lubeck 23538, Germany.
J Biomol Struct Dyn. 2009 Jun;26(6):679-86. doi: 10.1080/07391102.2009.10507281.
DNA-binding proteins (DNABPs) are important for various cellular processes, such as transcriptional regulation, recombination, replication, repair, and DNA modification. So far various bioinformatics and machine learning techniques have been applied for identification of DNA-binding proteins from protein structure. Only few methods are available for the identification of DNA binding proteins from protein sequence. In this work, we report a random forest method, DNA-Prot, to identify DNA binding proteins from protein sequence. Training was performed on the dataset containing 146 DNA-binding proteins and 250 non DNA-binding proteins. The algorithm was tested on the dataset containing 92 DNA-binding proteins and 100 non DNA-binding proteins. We obtained 80.31% accuracy from training and 84.37% accuracy from testing. Benchmarking analysis on the independent of 823 DNA-binding proteins and 823 non DNA-binding proteins shows that our approach can distinguish DNA-binding proteins from non DNA-binding proteins with more than 80% accuracy. We also compared our method with DNAbinder method on test dataset and two independent datasets. Comparable performance was observed from both methods on test dataset. In the benchmark dataset containing 823 DNA-binding proteins and 823 non DNA-binding proteins, we obtained significantly better performance from DNA-Prot with 81.83% accuracy whereas DNAbinder achieved only 61.42% accuracy using amino acid composition and 63.5% using PSSM profile. Similarly, DNA-Prot achieved better performance rate from the benchmark dataset containing 88 DNA-binding proteins and 233 non DNA-binding proteins. This result shows DNA-Prot can be efficiently used to identify DNA binding proteins from sequence information. The dataset and standalone version of DNA-Prot software can be obtained from http://www3.ntu.edu.sg/home/EPNSugan/index_files/dnaprot.htm.
DNA结合蛋白(DNABPs)对各种细胞过程都很重要,如转录调控、重组、复制、修复和DNA修饰。到目前为止,各种生物信息学和机器学习技术已被应用于从蛋白质结构中识别DNA结合蛋白。从蛋白质序列中识别DNA结合蛋白的方法很少。在这项工作中,我们报告了一种随机森林方法DNA-Prot,用于从蛋白质序列中识别DNA结合蛋白。在包含146个DNA结合蛋白和250个非DNA结合蛋白的数据集上进行训练。该算法在包含92个DNA结合蛋白和100个非DNA结合蛋白的数据集上进行测试。我们在训练中获得了80.31%的准确率,在测试中获得了84.37%的准确率。对823个DNA结合蛋白和823个非DNA结合蛋白的独立基准分析表明,我们的方法能够以超过80%的准确率区分DNA结合蛋白和非DNA结合蛋白。我们还在测试数据集和两个独立数据集上,将我们的方法与DNAbinder方法进行了比较。在测试数据集上,两种方法的性能相当。在包含823个DNA结合蛋白和823个非DNA结合蛋白的基准数据集中,我们的DNA-Prot方法表现显著更好,准确率为81.83%,而DNAbinder使用氨基酸组成时仅达到61.42%的准确率,使用PSSM谱时为63.5%。同样,在包含88个DNA结合蛋白和233个非DNA结合蛋白的基准数据集中,DNA-Prot也取得了更好的性能。这一结果表明,DNA-Prot可以有效地用于从序列信息中识别DNA结合蛋白。DNA-Prot软件的数据集和独立版本可从http://www3.ntu.edu.sg/home/EPNSugan/index_files/dnaprot.htm获取。