Meshkin Alireza, Ghafuri Hossein
Department of Computer Engineering, Payam Nour University of Damavand, Tehran, Iran.
EXCLI J. 2010 Feb 8;9:29-38. eCollection 2010.
Since, it is believed that the native structure of most proteins is defined by their sequences, utilizing data mining methods to extract hidden knowledge from protein sequences, are unavoidable. A major difficulty in mining bioinformatics data is due to the size of the datasets which contain frequently large numbers of variables. In this study, a two-step procedure for prediction of relative solvent accessibility of proteins is presented. In a first "feature selection" step, a small subset of evolutionary information is identified on the basis of selected physicochemical properties. In the second step, support vector regression is used to real value prediction of protein solvent accessibility with these custom selected features of evolutionary information. The experiment results show that the proposed method is an improvement in average prediction accuracy and training time.
由于人们认为大多数蛋白质的天然结构由其序列定义,因此利用数据挖掘方法从蛋白质序列中提取隐藏知识是不可避免的。挖掘生物信息学数据的一个主要困难在于数据集的规模,其中通常包含大量变量。在本研究中,提出了一种预测蛋白质相对溶剂可及性的两步程序。在第一个“特征选择”步骤中,基于选定的物理化学性质识别一小部分进化信息。在第二步中,使用支持向量回归对具有这些自定义选择的进化信息特征的蛋白质溶剂可及性进行实值预测。实验结果表明,所提出的方法在平均预测准确性和训练时间方面有所改进。