利用预测的局部结构进行折叠识别的隐马尔可夫模型：主链几何结构字母表

Hidden Markov models that use predicted local structure for fold recognition: alphabets of backbone geometry.

作者信息

Karchin Rachel, Cline Melissa, Mandel-Gutfreund Yael, Karplus Kevin

机构信息

Center for Biomolecular Science and Engineering, Baskin School of Engineering, University of California, Santa Cruz 95064, USA.

出版信息

Proteins. 2003 Jun 1;51(4):504-14. doi: 10.1002/prot.10369.

DOI:10.1002/prot.10369

PMID:12784210

Abstract

An important problem in computational biology is predicting the structure of the large number of putative proteins discovered by genome sequencing projects. Fold-recognition methods attempt to solve the problem by relating the target proteins to known structures, searching for template proteins homologous to the target. Remote homologs that may have significant structural similarity are often not detectable by sequence similarities alone. To address this, we incorporated predicted local structure, a generalization of secondary structure, into two-track profile hidden Markov models (HMMs). We did not rely on a simple helix-strand-coil definition of secondary structure, but experimented with a variety of local structure descriptions, following a principled protocol to establish which descriptions are most useful for improving fold recognition and alignment quality. On a test set of 1298 nonhomologous proteins, HMMs incorporating a 3-letter STRIDE alphabet improved fold recognition accuracy by 15% over amino-acid-only HMMs and 23% over PSI-BLAST, measured by ROC-65 numbers. We compared two-track HMMs to amino-acid-only HMMs on a difficult alignment test set of 200 protein pairs (structurally similar with 3-24% sequence identity). HMMs with a 6-letter STRIDE secondary track improved alignment quality by 62%, relative to DALI structural alignments, while HMMs with an STR track (an expanded DSSP alphabet that subdivides strands into six states) improved by 40% relative to CE.

摘要

计算生物学中的一个重要问题是预测基因组测序项目发现的大量假定蛋白质的结构。折叠识别方法试图通过将目标蛋白质与已知结构相关联来解决这个问题，即搜索与目标同源的模板蛋白质。仅靠序列相似性往往无法检测到可能具有显著结构相似性的远缘同源物。为了解决这个问题，我们将预测的局部结构（二级结构的一种推广）纳入双轨轮廓隐马尔可夫模型（HMM）。我们没有依赖于二级结构简单的螺旋-链-卷曲定义，而是尝试了各种局部结构描述，并遵循一个有原则的协议来确定哪些描述对于提高折叠识别和比对质量最有用。在一个由1298个非同源蛋白质组成的测试集上，通过ROC-65数值衡量，纳入三字母STRIDE字母表的HMM比仅使用氨基酸的HMM提高了15%的折叠识别准确率，比PSI-BLAST提高了23%。我们在一个由200对蛋白质组成的困难比对测试集（结构相似，序列同一性为3%-24%）上，将双轨HMM与仅使用氨基酸的HMM进行了比较。相对于DALI结构比对，具有六字母STRIDE二级轨道的HMM将比对质量提高了62%，而具有STR轨道（一种扩展的DSSP字母表，将链细分为六个状态）的HMM相对于CE提高了40%。