预测处于模糊区域序列的蛋白质二级结构含量。

Prediction of protein secondary structure content for the twilight zone sequences.

作者信息

Homaeian Leila, Kurgan Lukasz A, Ruan Jishou, Cios Krzysztof J, Chen Ke

机构信息

Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Alberta, Canada.

出版信息

Proteins. 2007 Nov 15;69(3):486-98. doi: 10.1002/prot.21527.

DOI:10.1002/prot.21527

PMID:17623861

Abstract

Secondary protein structure carries information about local structural arrangements, which include three major conformations: alpha-helices, beta-strands, and coils. Significant majority of successful methods for prediction of the secondary structure is based on multiple sequence alignment. However, multiple alignment fails to provide accurate results when a sequence comes from the twilight zone, that is, it is characterized by low (<30%) homology. To this end, we propose a novel method for prediction of secondary structure content through comprehensive sequence representation, called PSSC-core. The method uses a multiple linear regression model and introduces a comprehensive feature-based sequence representation to predict amount of helices and strands for sequences from the twilight zone. The PSSC-core method was tested and compared with two other state-of-the-art prediction methods on a set of 2187 twilight zone sequences. The results indicate that our method provides better predictions for both helix and strand content. The PSSC-core is shown to provide statistically significantly better results when compared with the competing methods, reducing the prediction error by 5-7% for helix and 7-9% for strand content predictions. The proposed feature-based sequence representation uses a comprehensive set of physicochemical properties that are custom-designed for each of the helix and strand content predictions. It includes composition and composition moment vectors, frequency of tetra-peptides associated with helical and strand conformations, various property-based groups like exchange groups, chemical groups of the side chains and hydrophobic group, auto-correlations based on hydrophobicity, side-chain masses, hydropathy, and conformational patterns for beta-sheets. The PSSC-core method provides an alternative for predicting the secondary structure content that can be used to validate and constrain results of other structure prediction methods. At the same time, it also provides useful insight into design of successful protein sequence representations that can be used in developing new methods related to prediction of different aspects of the secondary protein structure.

摘要

蛋白质二级结构携带有关局部结构排列的信息，其中包括三种主要构象：α螺旋、β链和卷曲。大多数成功的二级结构预测方法都是基于多序列比对。然而，当序列来自“模糊区域”时，即同源性较低（<30%）时，多序列比对无法提供准确结果。为此，我们提出了一种通过综合序列表示来预测二级结构含量的新方法，称为PSSC-core。该方法使用多元线性回归模型，并引入基于综合特征的序列表示，以预测来自模糊区域序列的螺旋和链的数量。PSSC-core方法在一组2187个模糊区域序列上进行了测试，并与其他两种最先进的预测方法进行了比较。结果表明，我们的方法对螺旋和链含量都能提供更好的预测。与竞争方法相比，PSSC-core在统计上显示出显著更好的结果，螺旋含量预测的误差降低了5-7%，链含量预测的误差降低了7-9%。所提出的基于特征的序列表示使用了一组专门为每个螺旋和链含量预测定制设计的综合物理化学性质。它包括组成和组成矩向量、与螺旋和链构象相关的四肽频率、各种基于性质的基团，如交换基团、侧链化学基团和疏水基团、基于疏水性的自相关、侧链质量、亲水性以及β折叠的构象模式。PSSC-core方法为预测二级结构含量提供了一种替代方法，可用于验证和约束其他结构预测方法的结果。同时，它也为成功的蛋白质序列表示的设计提供了有用的见解，可用于开发与蛋白质二级结构不同方面预测相关的新方法。