Churchill G A
Bull Math Biol. 1989;51(1):79-94. doi: 10.1007/BF02458837.
The composition of naturally occurring DNA sequences is often strikingly heterogeneous. In this paper, the DNA sequence is viewed as a stochastic process with local compositional properties determined by the states of a hidden Markov chain. The model used is a discrete-state, discrete-outcome version of a general model for non-stationary time series proposed by Kitagawa (1987). A smoothing algorithm is described which can be used to reconstruct the hidden process and produce graphic displays of the compositional structure of a sequence. The problem of parameter estimation is approached using likelihood methods and an EM algorithm for approximating the maximum likelihood estimate is derived. The methods are applied to sequences from yeast mitochondrial DNA, human and mouse mitochondrial DNAs, a human X chromosomal fragment and the complete genome of bacteriophage lambda.
天然存在的DNA序列组成往往具有显著的异质性。在本文中,DNA序列被视为一个随机过程,其局部组成特性由一个隐马尔可夫链的状态决定。所使用的模型是北川(1987年)提出的非平稳时间序列通用模型的离散状态、离散结果版本。描述了一种平滑算法,可用于重建隐藏过程并生成序列组成结构的图形显示。使用似然方法解决参数估计问题,并推导了一种用于近似最大似然估计的期望最大化(EM)算法。这些方法应用于酵母线粒体DNA、人类和小鼠线粒体DNA、人类X染色体片段以及噬菌体λ的完整基因组的序列。