Won Kyoung-Jae, Sandelin Albin, Marstrand Troels Torben, Krogh Anders
The Bioinformatics Centre, Department of Biology & Biotech Research and Innovation Centre, University of Copenhagen, Ole Maaloes Vej 5, 2200 Copenhagen N, Denmark.
Bioinformatics. 2008 Aug 1;24(15):1669-75. doi: 10.1093/bioinformatics/btn254. Epub 2008 Jun 5.
Describing and modeling biological features of eukaryotic promoters remains an important and challenging problem within computational biology. The promoters of higher eukaryotes in particular display a wide variation in regulatory features, which are difficult to model. Often several factors are involved in the regulation of a set of co-regulated genes. If so, promoters can be modeled with connected regulatory features, where the network of connections is characteristic for a particular mode of regulation.
With the goal of automatically deciphering such regulatory structures, we present a method that iteratively evolves an ensemble of regulatory grammars using a hidden Markov Model (HMM) architecture composed of interconnected blocks representing transcription factor binding sites (TFBSs) and background regions of promoter sequences. The ensemble approach reduces the risk of overfitting and generally improves performance. We apply this method to identify TFBSs and to classify promoters preferentially expressed in macrophages, where it outperforms other methods due to the increased predictive power given by the grammar.
The software and the datasets are available from http://modem.ucsd.edu/won/eHMM.tar.gz
在计算生物学中,描述和建模真核生物启动子的生物学特征仍然是一个重要且具有挑战性的问题。特别是高等真核生物的启动子在调控特征方面表现出广泛的差异,难以进行建模。通常,一组共同调控基因的调控涉及多个因素。如果是这样,启动子可以用相互关联的调控特征来建模,其中连接网络是特定调控模式的特征。
为了自动破解此类调控结构,我们提出了一种方法,该方法使用由代表转录因子结合位点(TFBS)和启动子序列背景区域的相互连接的模块组成的隐马尔可夫模型(HMM)架构,迭代地演化出一组调控语法。这种集成方法降低了过拟合的风险并总体上提高了性能。我们将此方法应用于识别TFBS并对在巨噬细胞中优先表达的启动子进行分类,由于语法赋予的预测能力增强,该方法在这方面优于其他方法。