Friedrich Torben, Koetschan Christian, Müller Tobias
University of Würzburg.
Stat Appl Genet Mol Biol. 2010;9:Article 6. doi: 10.2202/1544-6115.1480. Epub 2010 Jan 6.
Hidden Markov models (HMMs) play a major role in applications to unravel biomolecular functionality. Though HMMs are technically mature and widely applied in computational biology, there is a potential of methodical optimisation concerning its modelling of biological data sources with varying sequence lengths. Single building blocks of these models, the states, are associated with a certain holding time, being the link to the length distribution of represented sequence motifs. An adaptation of regular HMM topologies to bell-shaped sequence lengths is achieved by a serial chain-linking of hidden states, while residing in the class of conventional hidden Markov models. The factor of the repetition of states (r) and the parameter for state-specific duration of stay (p) are determined by fitting the distribution of sequence lengths with the method of moments (MM) and maximum likelihood (ML). Performance evaluations of differently adjusted HMM topologies underline the impact of an optimisation for HMMs based on sequence lengths. Secondary structure prediction on internal transcribed spacer 2 sequences demonstrates exemplarily the general impact of topological optimisations. In summary, we propose a general methodology to improve the modelling behaviour of HMMs by topological optimisation with ML and a fast and easily implementable moment estimator.
隐马尔可夫模型(HMMs)在揭示生物分子功能的应用中发挥着重要作用。尽管HMMs在技术上已经成熟,并广泛应用于计算生物学,但在对具有不同序列长度的生物数据源进行建模方面,仍有方法优化的潜力。这些模型的单个构建块,即状态,与特定的保持时间相关联,这是与所表示序列基序的长度分布的联系。通过隐藏状态的串行链接,在传统隐马尔可夫模型的类别中,实现了常规HMM拓扑结构对钟形序列长度的适配。状态重复因子(r)和特定状态停留持续时间的参数(p)通过用矩量法(MM)和最大似然法(ML)拟合序列长度分布来确定。对不同调整的HMM拓扑结构的性能评估强调了基于序列长度对HMM进行优化的影响。对内部转录间隔区2序列的二级结构预测示例性地展示了拓扑优化的总体影响。总之,我们提出了一种通用方法,通过使用最大似然法和快速且易于实现的矩估计器进行拓扑优化,来改善HMM的建模行为。