Benros Cristina, de Brevern Alexandre G, Etchebest Catherine, Hazout Serge
Equipe de Bioinformatique Génomique et Moléculaire, INSERM U726, Université Denis DIDEROT-Paris 7, Paris, France.
Proteins. 2006 Mar 1;62(4):865-80. doi: 10.1002/prot.20815.
We developed a novel approach for predicting local protein structure from sequence. It relies on the Hybrid Protein Model (HPM), an unsupervised clustering method we previously developed. This model learns three-dimensional protein fragments encoded into a structural alphabet of 16 protein blocks (PBs). Here, we focused on 11-residue fragments encoded as a series of seven PBs and used HPM to cluster them according to their local similarities. We thus built a library of 120 overlapping prototypes (mean fragments from each cluster), with good three-dimensional local approximation, i.e., a mean accuracy of 1.61 A Calpha root-mean-square distance. Our prediction method is intended to optimize the exploitation of the sequence-structure relations deduced from this library of long protein fragments. This was achieved by setting up a system of 120 experts, each defined by logistic regression to optimize the discrimination from sequence of a given prototype relative to the others. For a target sequence window, the experts computed probabilities of sequence-structure compatibility for the prototypes and ranked them, proposing the top scorers as structural candidates. Predictions were defined as successful when a prototype <2.5 A from the true local structure was found among those proposed. Our strategy yielded a prediction rate of 51.2% for an average of 4.2 candidates per sequence window. We also proposed a confidence index to estimate prediction quality. Our approach predicts from sequence alone and will thus provide valuable information for proteins without structural homologs. Candidates will also contribute to global structure prediction by fragment assembly.
我们开发了一种从序列预测局部蛋白质结构的新方法。它依赖于混合蛋白质模型(HPM),这是我们之前开发的一种无监督聚类方法。该模型学习编码为16个蛋白质模块(PBs)的结构字母表中的三维蛋白质片段。在这里,我们专注于编码为一系列七个PBs的11个残基片段,并使用HPM根据它们的局部相似性对其进行聚类。因此,我们构建了一个包含120个重叠原型(每个聚类的平均片段)的库,具有良好的三维局部近似,即平均准确度为1.61 Å Cα均方根距离。我们的预测方法旨在优化从这个长蛋白质片段库推导的序列-结构关系的利用。这是通过建立一个由120名专家组成的系统来实现的,每个专家由逻辑回归定义,以优化从给定原型序列相对于其他原型的区分。对于目标序列窗口,专家们计算原型的序列-结构兼容性概率并对其进行排名,将得分最高的作为结构候选。当在提议的原型中找到与真实局部结构距离<2.5 Å的原型时,预测被定义为成功。我们的策略在每个序列窗口平均有4.2个候选的情况下产生了51.2%的预测率。我们还提出了一个置信指数来估计预测质量。我们的方法仅从序列进行预测,因此将为没有结构同源物的蛋白质提供有价值的信息。候选物也将通过片段组装有助于全局结构预测。