Kumar Manish, Gromiha Michael M, Raghava Gajendra P S
Bioinformatics Centre, Institute of Microbial Technology, Sector 39A, Chandigarh-160036, India.
BMC Bioinformatics. 2007 Nov 27;8:463. doi: 10.1186/1471-2105-8-463.
Identification of DNA-binding proteins is one of the major challenges in the field of genome annotation, as these proteins play a crucial role in gene-regulation. In this paper, we developed various SVM modules for predicting DNA-binding domains and proteins. All models were trained and tested on multiple datasets of non-redundant proteins.
SVM models have been developed on DNAaset, which consists of 1153 DNA-binding and equal number of non DNA-binding proteins, and achieved the maximum accuracy of 72.42% and 71.59% using amino acid and dipeptide compositions, respectively. The performance of SVM model improved from 72.42% to 74.22%, when evolutionary information in form of PSSM profiles was used as input instead of amino acid composition. In addition, SVM models have been developed on DNAset, which consists of 146 DNA-binding and 250 non-binding chains/domains, and achieved the maximum accuracy of 79.80% and 86.62% using amino acid composition and PSSM profiles. The SVM models developed in this study perform better than existing methods on a blind dataset.
A highly accurate method has been developed for predicting DNA-binding proteins using SVM and PSSM profiles. This is the first study in which evolutionary information in form of PSSM profiles has been used successfully for predicting DNA-binding proteins. A web-server DNAbinder has been developed for identifying DNA-binding proteins and domains from query amino acid sequences http://www.imtech.res.in/raghava/dnabinder/.
DNA结合蛋白的识别是基因组注释领域的主要挑战之一,因为这些蛋白在基因调控中起着关键作用。在本文中,我们开发了各种支持向量机(SVM)模块来预测DNA结合结构域和蛋白。所有模型均在多个非冗余蛋白数据集上进行训练和测试。
在由1153个DNA结合蛋白和数量相等的非DNA结合蛋白组成的DNAaset上开发了支持向量机模型,分别使用氨基酸组成和二肽组成时,模型的最大准确率达到了72.42%和71.59%。当使用位置特异性得分矩阵(PSSM)谱形式的进化信息作为输入而非氨基酸组成时,支持向量机模型的性能从72.42%提高到了74.22%。此外,在由146个DNA结合链/结构域和250个非结合链/结构域组成的DNAset上开发了支持向量机模型,使用氨基酸组成和PSSM谱时,模型的最大准确率分别达到了79.80%和86.62%。本研究中开发的支持向量机模型在一个盲数据集上的表现优于现有方法。
已开发出一种使用支持向量机和PSSM谱来预测DNA结合蛋白的高精度方法。这是首次成功将PSSM谱形式的进化信息用于预测DNA结合蛋白的研究。已开发出一个网络服务器DNAbinder,用于从查询氨基酸序列中识别DNA结合蛋白和结构域(http://www.imtech.res.in/raghava/dnabinder/)。