Hvidsten Torgeir R, Kryshtafovych Andriy, Fidelis Krzysztof
Linnaeus Centre for Bioinformatics, Uppsala University, Uppsala, Sweden.
Proteins. 2009 Jun;75(4):870-84. doi: 10.1002/prot.22296.
Local protein structure representations that incorporate long-range contacts between residues are often considered in protein structure comparison but have found relatively little use in structure prediction where assembly from single backbone fragments dominates. Here, we introduce the concept of local descriptors of protein structure to characterize local neighborhoods of amino acids including short- and long-range interactions. We build a library of recurring local descriptors and show that this library is general enough to allow assembly of unseen protein structures. The library could on average re-assemble 83% of 119 unseen structures, and showed little or no performance decrease between homologous targets and targets with folds not represented among domains used to build it. We then systematically evaluate the descriptor library to establish the level of the sequence signal in sets of protein fragments of similar geometrical conformation. In particular, we test whether that signal is strong enough to facilitate correct assignment and alignment of these local geometries to new sequences. We use the signal to assign descriptors to a test set of 479 sequences with less than 40% sequence identity to any domain used to build the library, and show that on average more than 50% of the backbone fragments constituting descriptors can be correctly aligned. We also use the assigned descriptors to infer SCOP folds, and show that correct predictions can be made in many of the 151 cases where PSI-BLAST was unable to detect significant sequence similarity to proteins in the library. Although the combinatorial problem of simultaneously aligning several fragments to sequence is a major bottleneck compared with single fragment methods, the advantage of the current approach is that correct alignments imply correct long range distance constraints. The lack of these constraints is most likely the major reason why structure prediction methods fail to consistently produce adequate models when good templates are unavailable or undetectable. Thus, we believe that the current study offers new and valuable insight into the prediction of sequence-structure relationships in proteins.
在蛋白质结构比较中,常常会考虑纳入残基间长程接触的局部蛋白质结构表示,但在由单个主链片段组装主导的结构预测中,其应用相对较少。在此,我们引入蛋白质结构局部描述符的概念,以表征氨基酸的局部邻域,包括短程和长程相互作用。我们构建了一个重复出现的局部描述符库,并表明该库具有足够的通用性,能够组装未见的蛋白质结构。该库平均可重新组装119个未见结构中的83%,并且在同源靶标与用于构建它的结构域中未出现的折叠靶标之间,性能几乎没有下降或没有下降。然后,我们系统地评估描述符库,以确定相似几何构象的蛋白质片段集中序列信号的水平。特别是,我们测试该信号是否足够强,以促进这些局部几何结构与新序列的正确分配和比对。我们利用该信号为与用于构建库的任何结构域序列同一性小于40%的479个序列的测试集分配描述符,并表明构成描述符的主链片段平均有超过50%能够正确比对。我们还利用分配的描述符推断SCOP折叠,并表明在PSI-BLAST无法检测到与库中蛋白质有显著序列相似性的151个案例中的许多案例中,可以做出正确的预测。尽管与单片段方法相比,同时将多个片段与序列比对的组合问题是一个主要瓶颈,但当前方法的优势在于正确的比对意味着正确的长程距离约束。当没有可用的或无法检测到的良好模板时,缺乏这些约束很可能是结构预测方法未能始终生成足够模型的主要原因。因此,我们认为当前的研究为蛋白质序列-结构关系的预测提供了新的有价值的见解。