Chang Darby Tien-Hao, Huang Hsuan-Yu, Syu Yu-Tang, Wu Chih-Peng
Department of Electrical Engineering, National Cheng Kung University, Tainan, 70101, Taiwan, R.O.C.
BMC Bioinformatics. 2008 Dec 12;9 Suppl 12(Suppl 12):S12. doi: 10.1186/1471-2105-9-S12-S12.
Prediction of protein solvent accessibility, also called accessible surface area (ASA) prediction, is an important step for tertiary structure prediction directly from one-dimensional sequences. Traditionally, predicting solvent accessibility is regarded as either a two- (exposed or buried) or three-state (exposed, intermediate or buried) classification problem. However, the states of solvent accessibility are not well-defined in real protein structures. Thus, a number of methods have been developed to directly predict the real value ASA based on evolutionary information such as position specific scoring matrix (PSSM).
This study enhances the PSSM-based features for real value ASA prediction by considering the physicochemical properties and solvent propensities of amino acid types. We propose a systematic method for identifying residue groups with respect to protein solvent accessibility. The amino acid columns in the PSSM profile that belong to a certain residue group are merged to generate novel features. Finally, support vector regression (SVR) is adopted to construct a real value ASA predictor. Experimental results demonstrate that the features produced by the proposed selection process are informative for ASA prediction.
Experimental results based on a widely used benchmark reveal that the proposed method performs best among several of existing packages for performing ASA prediction. Furthermore, the feature selection mechanism incorporated in this study can be applied to other regression problems using the PSSM. The program and data are available from the authors upon request.
蛋白质溶剂可及性预测,也称为可及表面积(ASA)预测,是直接从一维序列预测三级结构的重要步骤。传统上,预测溶剂可及性被视为二分类(暴露或埋藏)或三分类(暴露、中间或埋藏)问题。然而,在真实的蛋白质结构中,溶剂可及性的状态并没有明确的定义。因此,已经开发了许多方法来基于诸如位置特异性评分矩阵(PSSM)等进化信息直接预测实际的ASA值。
本研究通过考虑氨基酸类型的物理化学性质和溶剂倾向,增强了基于PSSM的特征用于实际ASA值预测。我们提出了一种关于蛋白质溶剂可及性识别残基组的系统方法。将属于某个残基组的PSSM图谱中的氨基酸列合并以生成新的特征。最后,采用支持向量回归(SVR)构建实际ASA值预测器。实验结果表明,所提出的选择过程产生的特征对ASA预测具有信息价值。
基于广泛使用的基准数据集的实验结果表明,所提出的方法在现有的几个执行ASA预测的软件包中表现最佳。此外,本研究中纳入的特征选择机制可应用于使用PSSM的其他回归问题。程序和数据可根据作者要求提供。