一种基于序列片段的最大熵马尔可夫方法用于蛋白质二级结构预测。

A seqlet-based maximum entropy Markov approach for protein secondary structure prediction.

作者信息

Dong Qiwen, Wang Xiaolong, Lin Lei, Guan Yi

机构信息

School of Computer Science and Technology, Harbin Institute of Technology, China.

出版信息

Sci China C Life Sci. 2005 Aug;48(4):394-405. doi: 10.1360/062004-53.

DOI:10.1360/062004-53

PMID:16248433

Abstract

A novel method for predicting the secondary structures of proteins from amino acid sequence has been presented. The protein secondary structure seqlets that are analogous to the words in natural language have been extracted. These seqlets will capture the relationship between amino acid sequence and the secondary structures of proteins and further form the protein secondary structure dictionary. To be elaborate, the dictionary is organism-specific. Protein secondary structure prediction is formulated as an integrated word segmentation and part of speech tagging problem. The word-lattice is used to represent the results of the word segmentation and the maximum entropy model is used to calculate the probability of a seqlet tagged as a certain secondary structure type. The method is markovian in the seqlets, permitting efficient exact calculation of the posterior probability distribution over all possible word segmentations and their tags by viterbi algorithm. The optimal segmentations and their tags are computed as the results of protein secondary structure prediction. The method is applied to predict the secondary structures of proteins of four organisms respectively and compared with the PHD method. The results show that the performance of this method is higher than that of PHD by about 3.9% Q3 accuracy and 4.6% SOV accuracy. Combining with the local similarity protein sequences that are obtained by BLAST can give better prediction. The method is also tested on the 50 CASP5 target proteins with Q3 accuracy 78.9% and SOV accuracy 77.1%. A web server for protein secondary structure prediction has been constructed which is available at http://www.insun.hit.edu.cn:81/demos/biology/index.html.

摘要

提出了一种从氨基酸序列预测蛋白质二级结构的新方法。提取了类似于自然语言中单词的蛋白质二级结构序列片段。这些序列片段将捕捉氨基酸序列与蛋白质二级结构之间的关系，并进一步形成蛋白质二级结构词典。具体来说，该词典是特定生物体的。蛋白质二级结构预测被表述为一个综合的分词和词性标注问题。词格用于表示分词结果，最大熵模型用于计算被标记为某种二级结构类型的序列片段的概率。该方法在序列片段中具有马尔可夫性，允许通过维特比算法对所有可能的分词及其标签的后验概率分布进行高效精确计算。最优分词及其标签作为蛋白质二级结构预测的结果被计算出来。该方法分别应用于预测四种生物体蛋白质的二级结构，并与PHD方法进行比较。结果表明，该方法的性能比PHD方法高约3.9%的Q3准确率和4.6%的SOV准确率。结合通过BLAST获得的局部相似蛋白质序列可以得到更好的预测。该方法还在50个CASP5目标蛋白质上进行了测试，Q3准确率为78.9%，SOV准确率为77.1%。构建了一个蛋白质二级结构预测的网络服务器，可在http://www.insun.hit.edu.cn:81/demos/biology/index.html上获取。