优化蛋白质结构字母序列的方法。

Methods for optimizing the structure alphabet sequences of proteins.

作者信息

Dong Qi-wen, Wang Xiao-long, Lin Lei

机构信息

School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China.

出版信息

Comput Biol Med. 2007 Nov;37(11):1610-6. doi: 10.1016/j.compbiomed.2007.03.002. Epub 2007 May 10.

DOI:10.1016/j.compbiomed.2007.03.002

PMID:17493604

Abstract

Protein structure prediction based on fragment assemble has made great progress in recent years. Local protein structure prediction is receiving increased attention. One essential step of local protein structure prediction method is that the three-dimensional conformations must be compressed into one-dimensional series of letters of a structural alphabet. The traditional method assigns each structure fragment the structure alphabet that has the best local structure similarity. However, such locally optimal structure alphabet sequence does not guarantee to produce the globally optimal structure. This study presents two efficient methods trying to find the optimal structure alphabet sequence, which can model the native structures as accuracy as possible. First, a 28-letter structure alphabet is derived by clustering fragment in Cartesian space with fragment length of seven residues. The average quantization error of the 28 letters is 0.82 A in term of root mean square deviation. Then, two efficient methods are presented to encode the protein structures into series of structure alphabet letters, that is, the greedy and dynamic programming algorithm. They are tested on PDB database using the structure alphabet developed in Cartesian coordinates space (our structure alphabet) and in torsion angles space (the PB structure alphabet), respectively. The experimental results show that these two methods can find the approximately optimal structure alphabet sequences by searching a small fraction of the modeling space. The traditional local-optimization method achieves 26.27 A root mean square deviations between the reconstructed structures and the native one, while the modeling accuracy is improved to 3.28 A by the greedy algorithm. The results are helpful for local protein structure prediction.

摘要

近年来，基于片段组装的蛋白质结构预测取得了很大进展。局部蛋白质结构预测越来越受到关注。局部蛋白质结构预测方法的一个关键步骤是将三维构象压缩成结构字母表的一维字母序列。传统方法为每个结构片段分配具有最佳局部结构相似性的结构字母。然而，这种局部最优的结构字母序列并不能保证产生全局最优结构。本研究提出了两种有效的方法来寻找最优的结构字母序列，以尽可能准确地模拟天然结构。首先，通过在笛卡尔空间中对长度为七个残基的片段进行聚类，得到了一个28字母的结构字母表。就均方根偏差而言，这28个字母的平均量化误差为0.82埃。然后，提出了两种有效的方法将蛋白质结构编码为结构字母序列，即贪心算法和动态规划算法。分别使用在笛卡尔坐标空间（我们的结构字母表）和扭转角空间（PB结构字母表）中开发的结构字母表在PDB数据库上对它们进行了测试。实验结果表明，这两种方法可以通过搜索一小部分建模空间找到近似最优的结构字母序列。传统的局部优化方法在重建结构和天然结构之间的均方根偏差为26.27埃，而贪心算法将建模精度提高到了3.28埃。这些结果有助于局部蛋白质结构预测。