Peng Pei-Chen, Hassan Samee Md Abul, Sinha Saurabh
Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois.
Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois; Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois.
Biophys J. 2015 Mar 10;108(5):1257-67. doi: 10.1016/j.bpj.2014.12.037.
Prediction of gene expression levels from regulatory sequences is one of the major challenges of genomic biology today. A particularly promising approach to this problem is that taken by thermodynamics-based models that interpret an enhancer sequence in a given cellular context specified by transcription factor concentration levels and predict precise expression levels driven by that enhancer. Such models have so far not accounted for the effect of chromatin accessibility on interactions between transcription factor and DNA and consequently on gene-expression levels. Here, we extend a thermodynamics-based model of gene expression, called GEMSTAT (Gene Expression Modeling Based on Statistical Thermodynamics), to incorporate chromatin accessibility data and quantify its effect on accuracy of expression prediction. In the new model, called GEMSTAT-A, accessibility at a binding site is assumed to affect the transcription factor's binding strength at the site, whereas all other aspects are identical to the GEMSTAT model. We show that this modification results in significantly better fits in a data set of over 30 enhancers regulating spatial expression patterns in the blastoderm-stage Drosophila embryo. It is important to note that the improved fits result not from an overall elevated accessibility in active enhancers but from the variation of accessibility levels within an enhancer. With whole-genome DNA accessibility measurements becoming increasingly popular, our work demonstrates how such data may be useful for sequence-to-expression models. It also calls for future advances in modeling accessibility levels from sequence and the transregulatory context, so as to predict accurately the effect of cis and trans perturbations on gene expression.
从调控序列预测基因表达水平是当今基因组生物学面临的主要挑战之一。解决这个问题的一种特别有前景的方法是基于热力学的模型所采用的方法,该模型在由转录因子浓度水平指定的给定细胞环境中解释增强子序列,并预测由该增强子驱动的精确表达水平。到目前为止,这类模型尚未考虑染色质可及性对转录因子与DNA之间相互作用的影响,进而对基因表达水平的影响。在这里,我们扩展了一个基于热力学的基因表达模型,称为GEMSTAT(基于统计热力学的基因表达建模),以纳入染色质可及性数据,并量化其对表达预测准确性的影响。在新模型GEMSTAT-A中,假定结合位点的可及性会影响转录因子在该位点的结合强度,而所有其他方面与GEMSTAT模型相同。我们表明,这种修改在调控胚盘期果蝇胚胎空间表达模式的30多个增强子的数据集上产生了明显更好的拟合。需要注意的是,拟合的改善并非源于活性增强子中整体可及性的提高,而是源于增强子内可及性水平的变化。随着全基因组DNA可及性测量越来越普遍,我们的工作证明了此类数据如何有助于序列到表达模型。它还呼吁在从序列和转录调控背景对可及性水平进行建模方面取得进一步进展,以便准确预测顺式和反式扰动对基因表达的影响。