Department of Biological Science and Technology, National Chiao Tung University, Hsinchu, Taiwan.
BMC Bioinformatics. 2011 Feb 15;12 Suppl 1(Suppl 1):S47. doi: 10.1186/1471-2105-12-S1-S47.
Existing methods of predicting DNA-binding proteins used valuable features of physicochemical properties to design support vector machine (SVM) based classifiers. Generally, selection of physicochemical properties and determination of their corresponding feature vectors rely mainly on known properties of binding mechanism and experience of designers. However, there exists a troublesome problem for designers that some different physicochemical properties have similar vectors of representing 20 amino acids and some closely related physicochemical properties have dissimilar vectors.
This study proposes a systematic approach (named Auto-IDPCPs) to automatically identify a set of physicochemical and biochemical properties in the AAindex database to design SVM-based classifiers for predicting and analyzing DNA-binding domains/proteins. Auto-IDPCPs consists of 1) clustering 531 amino acid indices in AAindex into 20 clusters using a fuzzy c-means algorithm, 2) utilizing an efficient genetic algorithm based optimization method IBCGA to select an informative feature set of size m to represent sequences, and 3) analyzing the selected features to identify related physicochemical properties which may affect the binding mechanism of DNA-binding domains/proteins. The proposed Auto-IDPCPs identified m = 22 features of properties belonging to five clusters for predicting DNA-binding domains with a five-fold cross-validation accuracy of 87.12%, which is promising compared with the accuracy of 86.62% of the existing method PSSM-400. For predicting DNA-binding sequences, the accuracy of 75.50% was obtained using m = 28 features, where PSSM-400 has an accuracy of 74.22%. Auto-IDPCPs and PSSM-400 have accuracies of 80.73% and 82.81%, respectively, applied to an independent test data set of DNA-binding domains. Some typical physicochemical properties discovered are hydrophobicity, secondary structure, charge, solvent accessibility, polarity, flexibility, normalized Van Der Waals volume, pK (pK-C, pK-N, pK-COOH and pK-a(RCOOH)), etc.
The proposed approach Auto-IDPCPs would help designers to investigate informative physicochemical and biochemical properties by considering both prediction accuracy and analysis of binding mechanism simultaneously. The approach Auto-IDPCPs can be also applicable to predict and analyze other protein functions from sequences.
现有的 DNA 结合蛋白预测方法利用理化性质的有价值特征来设计基于支持向量机 (SVM) 的分类器。通常,理化性质的选择及其特征向量的确定主要依赖于结合机制的已知性质和设计者的经验。然而,对于设计者来说,存在一个麻烦的问题,即一些不同的理化性质具有相似的代表 20 种氨基酸的向量,而一些密切相关的理化性质具有不同的向量。
本研究提出了一种系统的方法 (命名为 Auto-IDPCPs),用于自动识别 AAindex 数据库中的一组理化和生化性质,以设计基于 SVM 的分类器来预测和分析 DNA 结合结构域/蛋白质。Auto-IDPCPs 包括 1)使用模糊 c-均值算法将 AAindex 中的 531 种氨基酸指数聚类成 20 个簇,2)利用高效的遗传算法基于优化方法 IBCGA 选择一个大小为 m 的信息量特征集来代表序列,3)分析所选特征以识别可能影响 DNA 结合结构域/蛋白质结合机制的相关理化性质。所提出的 Auto-IDPCPs 识别了属于五个簇的 m = 22 个属性特征,用于预测 DNA 结合结构域,五重交叉验证的准确率为 87.12%,与现有方法 PSSM-400 的准确率 86.62%相比,这是很有前途的。对于预测 DNA 结合序列,使用 m = 28 个特征得到了 75.50%的准确率,而 PSSM-400 的准确率为 74.22%。Auto-IDPCPs 和 PSSM-400 分别应用于独立的 DNA 结合结构域测试数据集,准确率为 80.73%和 82.81%。发现的一些典型理化性质包括疏水性、二级结构、电荷、溶剂可及性、极性、柔韧性、归一化范德华体积、pK(pK-C、pK-N、pK-COOH 和 pK-a(RCOOH)))等。
所提出的方法 Auto-IDPCPs 将有助于设计者通过同时考虑预测准确性和结合机制分析来研究有信息的理化和生化性质。该方法 Auto-IDPCPs 也可应用于从序列预测和分析其他蛋白质功能。