树结构上的成对隐马尔可夫模型。

Pair hidden Markov models on tree structures.

作者信息

Sakakibara Yasubumi

机构信息

Department of Biosciences and Informatics, Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama, 223-8522, Japan.

出版信息

Bioinformatics. 2003;19 Suppl 1:i232-40. doi: 10.1093/bioinformatics/btg1032.

DOI:10.1093/bioinformatics/btg1032

PMID:12855464

Abstract

MOTIVATION

Computationally identifying non-coding RNA regions on the genome has much scope for investigation and is essentially harder than gene-finding problems for protein-coding regions. Since comparative sequence analysis is effective for non-coding RNA detection, efficient computational methods are expected for structural alignments of RNA sequences. On the other hand, Hidden Markov Models (HMMs) have played important roles for modeling and analysing biological sequences. Especially, the concept of Pair HMMs (PHMMs) have been examined extensively as mathematical models for alignments and gene finding.

RESULTS

We propose the pair HMMs on tree structures (PHMMTSs), which is an extension of PHMMs defined on alignments of trees and provides a unifying framework and an automata-theoretic model for alignments of trees, structural alignments and pair stochastic context-free grammars. By structural alignment, we mean a pairwise alignment to align an unfolded RNA sequence into an RNA sequence of known secondary structure. First, we extend the notion of PHMMs defined on alignments of 'linear' sequences to pair stochastic tree automata, called PHMMTSs, defined on alignments of 'trees'. The PHMMTSs provide various types of alignments of trees such as affine-gap alignments of trees and an automata-theoretic model for alignment of trees. Second, based on the observation that a secondary structure of RNA can be represented by a tree, we apply PHMMTSs to the problem of structural alignments of RNAs. We modify PHMMTSs so that it takes as input a pair of a 'linear' sequence and a 'tree' representing a secondary structure of RNA to produce a structural alignment. Further, the PHMMTSs with input of a pair of two linear sequences is mathematically equal to the pair stochastic context-free grammars. We demonstrate some computational experiments to show the effectiveness of our method for structural alignments, and discuss a complexity issue of PHMMTSs.

摘要

动机

通过计算识别基因组上的非编码RNA区域有很大的研究空间，并且本质上比蛋白质编码区域的基因发现问题更难。由于比较序列分析对非编码RNA检测有效，因此期望有高效的计算方法用于RNA序列的结构比对。另一方面，隐马尔可夫模型（HMM）在生物序列的建模和分析中发挥了重要作用。特别是，配对隐马尔可夫模型（PHMM）的概念作为比对和基因发现的数学模型已被广泛研究。

结果

我们提出了树形结构上的配对隐马尔可夫模型（PHMMTS），它是在树比对上定义的PHMM的扩展，为树比对、结构比对和配对随机上下文无关文法提供了一个统一的框架和自动机理论模型。通过结构比对，我们指的是将一个展开的RNA序列与一个已知二级结构的RNA序列进行比对的双序列比对。首先，我们将在“线性”序列比对上定义的PHMM概念扩展到在“树”比对上定义的配对随机树自动机，即PHMMTS。PHMMTS提供了各种类型的树比对，如树的仿射间隙比对和树比对的自动机理论模型。其次，基于RNA的二级结构可以用树表示这一观察结果，我们将PHMMTS应用于RNA的结构比对问题。我们对PHMMTS进行修改，使其以一个“线性”序列和一个表示RNA二级结构的“树”的对作为输入，以产生一个结构比对。此外，输入为一对两个线性序列的PHMMTS在数学上等同于配对随机上下文无关文法。我们展示了一些计算实验，以证明我们的方法在结构比对方面的有效性，并讨论了PHMMTS的复杂性问题。