Solis Armando D, Rackovsky S
Department of Biomathematical Sciences, Mount Sinai Medical Center, New York, New York 10029, USA.
Proteins. 2002 Aug 15;48(3):463-86. doi: 10.1002/prot.10126.
We use basic ideas from information theory to extract the maximum amount of structural information available in protein sequence data. From a non-redundant set of protein X-ray structures, we construct local-sequence-dependent [phi,psi] distributions that summarize the influence of local sequence on backbone conformation. These distributions, approximations of actual backbone propensities in the folded protein, have the following properties: (1) They compensate for the problem of scarce data by an optimized combination of local-sequence-dependent and single-residue specific distributions; (2) They use multi-residue information; (3) They exploit similarities in the local coding properties of amino acids by collapsing the amino acid alphabet to streamline local sequence description; (4) They are designed to contain the maximum amount of local structural information the data set allows. Our methodology is able to extract around 30 cnats of information from the protein data set out of a total 387 cnats of initial uncertainty or entropy in a finely discretized [phi,psi] dihedral angle space (18 x 18 structural states), or about 7.8%. This was achieved at the hexamer length scale; shorter as well as longer fragments produce reduced information gains. The automatic clustering of amino acids into groups, a component of the optimization procedure, reveals patterns consistent with their local coding properties. While the overall information gain from local sequence is small, there are some local sequences that have significantly narrower structural distributions than others. Distribution width varies from at least 20% less than the average overall entropy to at least 14% above. This spread is an expression of the influence of local sequence on the conformational propensities of the backbone chain. The optimal ensemble of local-sequence-specific backbone distributions produced is useful as a guide to structural predictions from sequence, as well as a tool for further explorations of the nature of the local protein code.
我们运用信息论的基本理念,从蛋白质序列数据中提取尽可能多的结构信息。基于一组非冗余的蛋白质X射线结构,我们构建了依赖局部序列的[φ,ψ]分布,该分布总结了局部序列对主链构象的影响。这些分布作为折叠蛋白质中实际主链倾向的近似值,具有以下特性:(1) 通过依赖局部序列的分布与单残基特异性分布的优化组合,弥补数据稀缺的问题;(2) 利用多残基信息;(3) 通过精简氨基酸字母表以简化局部序列描述,从而利用氨基酸局部编码特性中的相似性;(4) 设计用于包含数据集所允许的最大局部结构信息。在精细离散的[φ,ψ]二面角空间(18×18结构状态)中,我们的方法能够从蛋白质数据集中提取约30比特的信息,初始不确定性或熵总计为387比特,即约7.8%。这是在六聚体长度尺度上实现的;更短或更长的片段产生的信息增益会降低。氨基酸自动聚类成组作为优化过程的一个组成部分,揭示了与其局部编码特性一致的模式。虽然局部序列的总体信息增益较小,但有些局部序列的结构分布明显比其他序列更窄。分布宽度的变化范围从比总体平均熵至少低20%到至少高14%。这种差异体现了局部序列对主链链构象倾向的影响。所生成的局部序列特异性主链分布的最优集合,可作为从序列进行结构预测的指南,以及进一步探索局部蛋白质编码本质的工具。