蛋白质折叠分类中的隐马尔可夫模型

HMMs in Protein Fold Classification.

作者信息

Lampros Christos, Papaloukas Costas, Exarchos Themis, Fotiadis Dimitrios I

机构信息

Unit of Medical Technology and Intelligent Information Systems, Department of Materials Science and Engineering, University of Ioannina, University Campus of Ioannina, GR45110, Ioannina, Greece.

Department of Biological Applications and Technology, University of Ioannina, Ioannina, Greece.

出版信息

Methods Mol Biol. 2017;1552:13-27. doi: 10.1007/978-1-4939-6753-7_2.

DOI:10.1007/978-1-4939-6753-7_2

PMID:28224488

Abstract

The limitation of most HMMs is their inherent high dimensionality. Therefore we developed several variations of low complexity models that can be applied even to protein families with a few members. In this chapter we present these variations. All of them include the use of a hidden Markov model (HMM), with a small number of states (called reduced state-space HMM), which is trained with both amino acid sequence and secondary structure of proteins whose 3D structure is known and it is used for protein fold classification. We used data from Protein Data Bank and annotation from SCOP database for training and evaluation of the proposed HMM variations for a number of protein folds that belong to major structural classes. Results indicate that the variations have similar performance, or even better in some cases, on classifying proteins than SAM, which is a widely used HMM-based method for protein classification. The major advantage of the proposed variations is that we employed a small number of states and the algorithms used for training and scoring are of low complexity and thus relatively fast. The main variations examined include a version of the reduced state-space HMM with seven states (7-HMM), a version of the reduced state-space HMM with three states (3-HMM) and an optimized version of the reduced state-space HMM with three states, where an optimization process is applied to its scores (optimized 3-HMM).

摘要

大多数隐马尔可夫模型（HMM）的局限性在于其固有的高维性。因此，我们开发了几种低复杂度模型的变体，这些变体甚至可以应用于只有少数成员的蛋白质家族。在本章中，我们将介绍这些变体。它们都包括使用一个具有少量状态的隐马尔可夫模型（称为简化状态空间HMM），该模型使用已知三维结构的蛋白质的氨基酸序列和二级结构进行训练，并用于蛋白质折叠分类。我们使用来自蛋白质数据库（Protein Data Bank）的数据和来自结构分类数据库（SCOP）的注释，对属于主要结构类别的多种蛋白质折叠的HMM变体进行训练和评估。结果表明，这些变体在蛋白质分类方面与SAM（一种广泛使用的基于HMM的蛋白质分类方法）具有相似的性能，在某些情况下甚至更好。所提出变体的主要优点是我们使用了少量状态，并且用于训练和评分的算法复杂度较低，因此相对较快。所研究的主要变体包括一个具有七个状态的简化状态空间HMM版本（7-HMM）、一个具有三个状态的简化状态空间HMM版本（3-HMM）以及一个具有三个状态的简化状态空间HMM的优化版本，其中对其分数应用了优化过程（优化3-HMM）。