Instytut Genetyki i Hodowli Zwierząt Polskiej Akademii Nauk, Jastrzębiec, Magdalenka, Poland.
PLoS One. 2018 Jun 21;13(6):e0198961. doi: 10.1371/journal.pone.0198961. eCollection 2018.
Understanding how regulatory elements control mammalian gene expression is a challenge of post-genomic era. We previously reported that size of proximal promoter architecture predicted the breadth of expression (fraction of tissues in which a gene is expressed). Herein, the contributions of individual transcription factors (TFs) were quantified. Several technologies of statistical modelling were utilized and compared: tree models, generalized linear models (GLMs, without and with regularization), Bayesian GLMs and random forest. Both linear and non-linear modelling strategies were explored. Encouragingly, different models led to similar statistical conclusions and biological interpretations. The majority of ENCODE TFs correlated positively with housekeeping expression, a minority correlated negatively. Thus, housekeeping expression can be understood as a cumulative effect of many types of TF binding sites. This is accompanied by the exclusion of fewer types of binding sites for TFs which are repressors, or support cell lineage commitment or temporarily inducible or spatially-restricted expression.
理解调控元件如何控制哺乳动物基因表达是后基因组时代的一个挑战。我们之前报道过,近端启动子结构的大小可以预测基因表达的广度(基因在其中表达的组织比例)。在此,我们量化了单个转录因子(TFs)的贡献。利用了几种统计建模技术并进行了比较:树模型、广义线性模型(GLMs,不包括和包括正则化)、贝叶斯 GLMs 和随机森林。线性和非线性建模策略都进行了探索。令人鼓舞的是,不同的模型得出了相似的统计结论和生物学解释。大多数 ENCODE TFs 与管家表达呈正相关,少数与管家表达呈负相关。因此,管家表达可以被理解为许多类型的 TF 结合位点的累积效应。这伴随着更少类型的结合位点被用于抑制物 TF,或者支持细胞谱系决定,或者暂时诱导或空间限制表达。