Chen Gong, Zhou Qing
Department of Statistics, University of California, Los Angeles, Los Angeles, California 90095, USA.
Biometrics. 2010 Sep;66(3):694-704. doi: 10.1111/j.1541-0420.2009.01362.x.
Transcription factors bind sequence-specific sites in DNA to regulate gene transcription. Identifying transcription factor binding sites (TFBSs) is an important step for understanding gene regulation. Although sophisticated in modeling TFBSs and their combinatorial patterns, computational methods for TFBS detection and motif finding often make oversimplified homogeneous model assumptions for background sequences. Since nucleotide base composition varies across genomic regions, it is expected to be helpful for motif finding to incorporate the heterogeneity into background modeling. When sequences from multiple species are utilized, variation in evolutionary conservation violates the common assumption of an identical conservation level in multiple alignments. To handle both types of heterogeneity, we propose a generative model in which a segmented Markov chain is used to partition a multiple alignment into regions of homogeneous nucleotide base composition and a hidden Markov model (HMM) is employed to account for different conservation levels. Bayesian inference on the model is developed via Gibbs sampling with dynamic programming recursions. Simulation studies and empirical evidence from biological data sets reveal the dramatic effect of background modeling on motif finding, and demonstrate that the proposed approach is able to achieve substantial improvements over commonly used background models.
转录因子与DNA中的序列特异性位点结合以调节基因转录。识别转录因子结合位点(TFBSs)是理解基因调控的重要一步。尽管在对TFBSs及其组合模式进行建模方面很复杂,但用于TFBS检测和基序查找的计算方法通常对背景序列做出过于简化的均匀模型假设。由于核苷酸碱基组成在基因组区域间存在差异,将这种异质性纳入背景建模预计会有助于基序查找。当使用多个物种的序列时,进化保守性的差异违反了多重比对中相同保守水平的常见假设。为了处理这两种类型的异质性,我们提出一种生成模型,其中使用分段马尔可夫链将多重比对划分为具有均匀核苷酸碱基组成的区域,并采用隐马尔可夫模型(HMM)来考虑不同的保守水平。通过带有动态规划递归的吉布斯采样对模型进行贝叶斯推断。模拟研究和来自生物数据集的经验证据揭示了背景建模对基序查找的显著影响,并表明所提出的方法能够比常用的背景模型有实质性的改进。