Dresch Jacqueline M, Zellers Rowan G, Bork Daniel K, Drewell Robert A
Department of Mathematics and Computer Science, Clark University, Worcester, MA, USA.
Computer Science Department, Harvey Mudd College, Claremont, CA, USA.; Mathematics Department, Harvey Mudd College, Claremont, CA, USA.
Gene Regul Syst Bio. 2016 Jun 12;10:21-33. doi: 10.4137/GRSB.S38462. eCollection 2016.
A long-standing objective in modern biology is to characterize the molecular components that drive the development of an organism. At the heart of eukaryotic development lies gene regulation. On the molecular level, much of the research in this field has focused on the binding of transcription factors (TFs) to regulatory regions in the genome known as cis-regulatory modules (CRMs). However, relatively little is known about the sequence-specific binding preferences of many TFs, especially with respect to the possible interdependencies between the nucleotides that make up binding sites. A particular limitation of many existing algorithms that aim to predict binding site sequences is that they do not allow for dependencies between nonadjacent nucleotides. In this study, we use a recently developed computational algorithm, MARZ, to compare binding site sequences using 32 distinct models in a systematic and unbiased approach to explore nucleotide dependencies within binding sites for 15 distinct TFs known to be critical to Drosophila development. Our results indicate that many of these proteins have varying levels of nucleotide interdependencies within their DNA recognition sequences, and that, in some cases, models that account for these dependencies greatly outperform traditional models that are used to predict binding sites. We also directly compare the ability of different models to identify the known KRUPPEL TF binding sites in CRMs and demonstrate that a more complex model that accounts for nucleotide interdependencies performs better when compared with simple models. This ability to identify TFs with critical nucleotide interdependencies in their binding sites will lead to a deeper understanding of how these molecular characteristics contribute to the architecture of CRMs and the precise regulation of transcription during organismal development.
现代生物学的一个长期目标是确定驱动生物体发育的分子成分。真核生物发育的核心是基因调控。在分子水平上,该领域的许多研究都集中在转录因子(TFs)与基因组中被称为顺式调控模块(CRMs)的调控区域的结合上。然而,对于许多转录因子的序列特异性结合偏好,尤其是关于构成结合位点的核苷酸之间可能的相互依赖性,人们了解得相对较少。许多旨在预测结合位点序列的现有算法的一个特别局限在于,它们不考虑非相邻核苷酸之间的依赖性。在本研究中,我们使用一种最近开发的计算算法MARZ,采用系统且无偏的方法,使用32种不同模型比较结合位点序列,以探索15种已知对果蝇发育至关重要的不同转录因子结合位点内的核苷酸依赖性。我们的结果表明,这些蛋白质中的许多在其DNA识别序列内具有不同程度的核苷酸相互依赖性,并且在某些情况下,考虑这些依赖性的模型大大优于用于预测结合位点的传统模型。我们还直接比较了不同模型识别CRMs中已知的KRUPPEL转录因子结合位点的能力,并证明与简单模型相比,考虑核苷酸相互依赖性的更复杂模型表现更好。这种识别在其结合位点具有关键核苷酸相互依赖性的转录因子的能力,将有助于更深入地理解这些分子特征如何影响CRMs的结构以及生物体发育过程中转录的精确调控。