通过最小化多重比对熵的聚类分析算法推导非齐次DNA马尔可夫链模型。

Deriving non-homogeneous DNA Markov chain models by cluster analysis algorithm minimizing multiple alignment entropy.

作者信息

Borodovsky M, Peresetsky A

机构信息

School of Biology, Georgia Institute of Technology, Atlanta 30332-0230.

出版信息

Comput Chem. 1994 Sep;18(3):259-67. doi: 10.1016/0097-8485(94)85022-4.

DOI:10.1016/0097-8485(94)85022-4

PMID:7952897

Abstract

Non-homogeneous Markov chain models can represent biologically important regions of DNA sequences. The statistical pattern that is described by these models is usually weak and was found primarily because of strong biological indications. The general method for extracting similar patterns is presented in the current paper. The algorithm incorporates cluster analysis, multiple alignment and entropy minimization. The method was first tested using the set of DNA sequences produced by Markov chain generators. It was shown that artificial gene sequences, which initially have been randomly set up along the multiple alignment panels, are aligned according to the hidden triplet phase. Then the method was applied to real protein-coding sequences and the resulting alignment clearly indicated the triplet phase and produced the parameters of the optimal 3-periodic non-homogeneous Markov chain model. These Markov models were already employed in the GeneMark gene prediction algorithm, which is used in genome sequencing projects. The algorithm can also handle the case in which the sequences to be aligned reveal different statistical patterns, such as Escherichia coli protein-coding sequences belonging to Class II and Class III. The algorithm accepts a random mix of sequences from different classes, and is able to separate them into two groups (clusters), align each cluster separately, and define a non-homogeneous Markov chain model for each sequence cluster.

摘要

非齐次马尔可夫链模型能够表示DNA序列中具有生物学重要性的区域。这些模型所描述的统计模式通常较弱，主要是由于强烈的生物学指征才被发现。本文介绍了提取相似模式的通用方法。该算法结合了聚类分析、多重比对和熵最小化。该方法首先使用马尔可夫链生成器产生的DNA序列集进行测试。结果表明，最初沿着多重比对面板随机设置的人工基因序列会根据隐藏的三联体相位进行比对。然后将该方法应用于真实的蛋白质编码序列，所得比对结果清晰地表明了三联体相位，并产生了最优的3周期非齐次马尔可夫链模型的参数。这些马尔可夫模型已被用于基因组测序项目中使用的GeneMark基因预测算法。该算法还能处理待比对序列呈现不同统计模式的情况，比如属于II类和III类的大肠杆菌蛋白质编码序列。该算法接受来自不同类别的序列的随机混合，并能够将它们分成两组（簇），分别比对每个簇，并为每个序列簇定义一个非齐次马尔可夫链模型。