Chen Ke, Jiang Yingfu, Du Li, Kurgan Lukasz
Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Alberta, Canada.
J Comput Chem. 2009 Jan 15;30(1):163-72. doi: 10.1002/jcc.21053.
A computational model, IMP-TYPE, is proposed for the classification of five types of integral membrane proteins from protein sequence. The proposed model aims not only at providing accurate predictions but most importantly it incorporates interesting and transparent biological patterns. When contrasted with the best-performing existing models, IMP-TYPE reduces the error rates of these methods by 19 and 34% for two out-of-sample tests performed on benchmark datasets. Our empirical evaluations also show that the proposed method provides even bigger improvements, i.e., 29 and 45% error rate reductions, when predictions are performed for sequences that share low (40%) identity with sequences from the training dataset. We also show that IMP-TYPE can be used in a standalone mode, i.e., it duplicates significant majority of correct predictions provided by other leading methods, while providing additional correct predictions which are incorrectly classified by the other methods. Our method computes predictions using a Support Vector Machine classifier that takes feature-based encoded sequence as its input. The input feature set includes hydrophobic AA pairs, which were selected by utilizing a consensus of three feature selection algorithms. The hydrophobic residues that build up the AA pairs used by our method are shown to be associated with the formation of transmembrane helices in a few recent studies concerning integral membrane proteins. Our study also indicates that Met and Phe display a certain degree of hydrophobicity, which may be more crucial than their polarity or aromaticity when they occur in the transmembrane segments. This conclusion is supported by a recent study on potential of mean force for membrane protein folding and a study of scales for membrane propensity of amino acids.
提出了一种名为IMP-TYPE的计算模型,用于根据蛋白质序列对五种类型的整合膜蛋白进行分类。该模型不仅旨在提供准确的预测,更重要的是它融入了有趣且透明的生物学模式。与现有表现最佳的模型相比,在对基准数据集进行的两次样本外测试中,IMP-TYPE将这些方法的错误率分别降低了19%和34%。我们的实证评估还表明,当对与训练数据集序列具有低(40%)同一性的序列进行预测时,该方法能带来更大的改进,即错误率分别降低29%和45%。我们还表明,IMP-TYPE可以独立使用,也就是说,它能复制其他领先方法提供的绝大多数正确预测,同时还能提供其他方法错误分类的额外正确预测。我们的方法使用支持向量机分类器进行预测,该分类器将基于特征编码的序列作为输入。输入特征集包括疏水氨基酸对,这些是通过三种特征选择算法的共识选择出来的。在最近一些关于整合膜蛋白的研究中,我们方法所使用的构成氨基酸对的疏水残基与跨膜螺旋的形成有关。我们的研究还表明,甲硫氨酸和苯丙氨酸表现出一定程度的疏水性,当它们出现在跨膜片段中时,疏水性可能比它们的极性或芳香性更为关键。这一结论得到了最近关于膜蛋白折叠平均力势的研究以及氨基酸膜倾向性标度研究的支持。