一种用于蛋白质的隐马尔可夫模型衍生结构字母表。

A hidden markov model derived structural alphabet for proteins.

作者信息

Camproux A C, Gautier R, Tufféry P

机构信息

Equipe de Bioinformatique Génomique et Moléculaire, INSERM E0436, Université Paris 7, case 7113, 2 place Jussieu, 75251 Paris, France.

出版信息

J Mol Biol. 2004 Jun 4;339(3):591-605. doi: 10.1016/j.jmb.2004.04.005.

DOI:10.1016/j.jmb.2004.04.005

PMID:15147844

Abstract

Understanding and predicting protein structures depends on the complexity and the accuracy of the models used to represent them. We have set up a hidden Markov model that discretizes protein backbone conformation as series of overlapping fragments (states) of four residues length. This approach learns simultaneously the geometry of the states and their connections. We obtain, using a statistical criterion, an optimal systematic decomposition of the conformational variability of the protein peptidic chain in 27 states with strong connection logic. This result is stable over different protein sets. Our model fits well the previous knowledge related to protein architecture organisation and seems able to grab some subtle details of protein organisation, such as helix sub-level organisation schemes. Taking into account the dependence between the states results in a description of local protein structure of low complexity. On an average, the model makes use of only 8.3 states among 27 to describe each position of a protein structure. Although we use short fragments, the learning process on entire protein conformations captures the logic of the assembly on a larger scale. Using such a model, the structure of proteins can be reconstructed with an average accuracy close to 1.1A root-mean-square deviation and for a complexity of only 3. Finally, we also observe that sequence specificity increases with the number of states of the structural alphabet. Such models can constitute a very relevant approach to the analysis of protein architecture in particular for protein structure prediction.

摘要

理解和预测蛋白质结构取决于用于表示它们的模型的复杂性和准确性。我们建立了一个隐马尔可夫模型，该模型将蛋白质主链构象离散化为一系列四个残基长度的重叠片段（状态）。这种方法同时学习状态的几何形状及其连接。我们使用统计标准，获得了蛋白质肽链构象变异性在27个具有强连接逻辑的状态下的最优系统分解。这个结果在不同的蛋白质组中是稳定的。我们的模型很好地符合了与蛋白质结构组织相关的先前知识，并且似乎能够捕捉到蛋白质组织的一些细微细节，比如螺旋子水平的组织方案。考虑到状态之间的依赖性，会得到一个低复杂性的局部蛋白质结构描述。平均而言，该模型在27个状态中仅使用8.3个状态来描述蛋白质结构的每个位置。虽然我们使用的是短片段，但对整个蛋白质构象的学习过程在更大规模上捕捉到了组装的逻辑。使用这样一个模型，可以以平均精度接近1.1埃的均方根偏差且仅具有3的复杂性来重建蛋白质结构。最后，我们还观察到序列特异性随着结构字母表状态数的增加而增加。这样的模型可以构成一种非常相关的蛋白质结构分析方法，特别是用于蛋白质结构预测。