使用马尔可夫模型进行转录结合位点预测。

Transcription binding site prediction using Markov models.

作者信息

Abnizova Irina, Rust Alistair G, Robinson Mark, Te Boekhorst Rene, Gilks Walter R

机构信息

BSU MRC Cambridge, CB2 2SR, UK.

出版信息

J Bioinform Comput Biol. 2006 Apr;4(2):425-41. doi: 10.1142/s0219720006001813.

DOI:10.1142/s0219720006001813

PMID:16819793

Abstract

One of the main goals of analysing DNA sequences is to understand the temporal and positional information that specifies gene expression. An important step in this process is the recognition of gene expression regulatory elements. Experimental procedures for this are slow and costly. In this paper we present a computational non-supervised algorithm that facilitates the process by statistically identifying the most likely regions within a putative regulatory sequence. A probabilistic technique is presented, based on the approximation of regulatory DNA with a Markov chain, for the location of putative transcription factor binding sites in a single stretch of DNA. Hereto we developed a procedure to approximate the order of Markov model for a given DNA sequence that circumvents some of the prohibitive assumptions underlying Markov modeling. Application of the algorithm to data from 55 genes in five species shows the high sensitivity of this Markov search algorithm. Our algorithm does not require any prior knowledge in the form of description or cross-genomic comparison; it is context sensitive and takes DNA heterogeneity into account.

摘要

分析DNA序列的主要目标之一是了解指定基因表达的时间和位置信息。这一过程中的一个重要步骤是识别基因表达调控元件。为此进行的实验过程缓慢且成本高昂。在本文中，我们提出了一种计算非监督算法，通过统计识别假定调控序列中最可能的区域来促进这一过程。提出了一种概率技术，基于用马尔可夫链对调控DNA进行近似，用于在一段DNA中定位假定的转录因子结合位点。为此，我们开发了一种程序来近似给定DNA序列的马尔可夫模型阶数，该程序规避了马尔可夫建模所依据的一些过高假设。将该算法应用于五个物种中55个基因的数据，显示了这种马尔可夫搜索算法的高灵敏度。我们的算法不需要任何以描述或跨基因组比较形式存在的先验知识；它是上下文敏感的，并考虑了DNA的异质性。