Aydin Zafer, Altunbasak Yucel, Borodovsky Mark
School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30332-0250, USA.
BMC Bioinformatics. 2006 Mar 30;7:178. doi: 10.1186/1471-2105-7-178.
The accuracy of protein secondary structure prediction has been improving steadily towards the 88% estimated theoretical limit. There are two types of prediction algorithms: Single-sequence prediction algorithms imply that information about other (homologous) proteins is not available, while algorithms of the second type imply that information about homologous proteins is available, and use it intensively. The single-sequence algorithms could make an important contribution to studies of proteins with no detected homologs, however the accuracy of protein secondary structure prediction from a single-sequence is not as high as when the additional evolutionary information is present.
In this paper, we further refine and extend the hidden semi-Markov model (HSMM) initially considered in the BSPSS algorithm. We introduce an improved residue dependency model by considering the patterns of statistically significant amino acid correlation at structural segment borders. We also derive models that specialize on different sections of the dependency structure and incorporate them into HSMM. In addition, we implement an iterative training method to refine estimates of HSMM parameters. The three-state-per-residue accuracy and other accuracy measures of the new method, IPSSP, are shown to be comparable or better than ones for BSPSS as well as for PSIPRED, tested under the single-sequence condition.
We have shown that new dependency models and training methods bring further improvements to single-sequence protein secondary structure prediction. The results are obtained under cross-validation conditions using a dataset with no pair of sequences having significant sequence similarity. As new sequences are added to the database it is possible to augment the dependency structure and obtain even higher accuracy. Current and future advances should contribute to the improvement of function prediction for orphan proteins inscrutable to current similarity search methods.
蛋白质二级结构预测的准确性一直在稳步提高,朝着估计的88%的理论极限迈进。有两种类型的预测算法:单序列预测算法意味着无法获得其他(同源)蛋白质的信息,而第二类算法意味着可以获得同源蛋白质的信息,并大量使用这些信息。单序列算法可以为未检测到同源物的蛋白质研究做出重要贡献,然而,从单序列预测蛋白质二级结构的准确性不如存在额外进化信息时高。
在本文中,我们进一步完善和扩展了最初在BSPSS算法中考虑的隐藏半马尔可夫模型(HSMM)。我们通过考虑结构片段边界处具有统计学意义的氨基酸相关性模式,引入了一种改进的残基依赖性模型。我们还推导了专门针对依赖性结构不同部分的模型,并将它们纳入HSMM。此外,我们实现了一种迭代训练方法来细化HSMM参数的估计。在单序列条件下进行测试时,新方法IPSSP的每残基三状态准确性和其他准确性指标显示与BSPSS以及PSIPRED相当或更好。
我们已经表明,新的依赖性模型和训练方法进一步改进了单序列蛋白质二级结构预测。结果是在交叉验证条件下使用一个没有一对序列具有显著序列相似性的数据集获得的。随着新序列添加到数据库中,有可能增强依赖性结构并获得更高的准确性。当前和未来的进展应该有助于改进当前相似性搜索方法难以捉摸的孤儿蛋白质的功能预测。