Zhang Ziding, Lindstam Mats, Unge Johan, Peterson Carsten, Lu Guoguang
Department of Molecular Biophysics, Center for Chemistry and Chemical Engineering, Lund University, P.O. Box 124, SE-221 00 Lund, Sweden.
J Mol Biol. 2003 Sep 5;332(1):127-42. doi: 10.1016/s0022-2836(03)00858-1.
A novel method has been developed for acquiring the correct alignment of a query sequence against remotely homologous proteins by extracting structural information from profiles of multiple structure alignment. A systematic search algorithm combined with a group of score functions based on sequence information and structural information has been introduced in this procedure. A limited number of top solutions (15,000) with high scores were selected as candidates for further examination. On a test-set comprising 301 proteins from 75 protein families with sequence identity less than 30%, the proportion of proteins with completely correct alignment as first candidate was improved to 39.8% by our method, whereas the typical performance of existing sequence-based alignment methods was only between 16.1% and 22.7%. Furthermore, multiple candidates for possible alignment were provided in our approach, which dramatically increased the possibility of finding correct alignment, such that completely correct alignments were found amongst the top-ranked 1000 candidates in 88.3% of the proteins. With the assistance of a sequence database, completely correct alignment solutions were achieved amongst the top 1000 candidates in 94.3% of the proteins. From such a limited number of candidates, it would become possible to identify more correct alignment using a more time-consuming but more powerful method with more detailed structural information, such as side-chain packing and energy minimization, etc. The results indicate that the novel alignment strategy could be helpful for extending the application of highly reliable methods for fold identification and homology modeling to a huge number of homologous proteins of low sequence similarity. Details of the methods, together with the results and implications for future development are presented.
已经开发出一种新方法,通过从多结构比对的轮廓中提取结构信息,来获得查询序列与远程同源蛋白质的正确比对。在此过程中引入了一种系统搜索算法,并结合了一组基于序列信息和结构信息的评分函数。选择了数量有限的高分顶级解决方案(15,000个)作为进一步检查的候选方案。在一个由来自75个蛋白质家族的301种蛋白质组成的测试集上,序列同一性小于30%,我们的方法将作为首个候选方案具有完全正确比对的蛋白质比例提高到了39.8%,而现有基于序列的比对方法的典型性能仅在16.1%至22.7%之间。此外,我们的方法提供了多个可能比对的候选方案,这极大地增加了找到正确比对的可能性,以至于在88.3%的蛋白质中,在前1000个排名靠前的候选方案中找到了完全正确的比对。在序列数据库的辅助下,在94.3%的蛋白质中,在前1000个候选方案中实现了完全正确的比对解决方案。从如此有限数量的候选方案中,使用一种更耗时但更强大的方法,结合更详细的结构信息,如侧链堆积和能量最小化等,有可能识别出更多正确的比对。结果表明,这种新的比对策略有助于将高度可靠的折叠识别和同源建模方法的应用扩展到大量低序列相似性的同源蛋白质。本文介绍了该方法的详细信息,以及结果和对未来发展的启示。