Daigle Bernie J, Soltani Mohammad, Petzold Linda R, Singh Abhyudai
Institute for Collaborative Biotechnologies, University of California, Santa Barbara, CA 93106, Department of Electrical and Computer Engineering, University of Delaware, Newark, DE 19716 and Department of Computer Science, University of California, Santa Barbara, CA 93106, USA.
Bioinformatics. 2015 May 1;31(9):1428-35. doi: 10.1093/bioinformatics/btv007. Epub 2015 Jan 7.
Stochastic promoter switching between transcriptionally active (ON) and inactive (OFF) states is a major source of noise in gene expression. It is often implicitly assumed that transitions between promoter states are memoryless, i.e. promoters spend an exponentially distributed time interval in each of the two states. However, increasing evidence suggests that promoter ON/OFF times can be non-exponential, hinting at more complex transcriptional regulatory architectures. Given the essential role of gene expression in all cellular functions, efficient computational techniques for characterizing promoter architectures are critically needed.
We have developed a novel model reduction for promoters with arbitrary numbers of ON and OFF states, allowing us to approximate complex promoter switching behavior with Weibull-distributed ON/OFF times. Using this model reduction, we created bursty Monte Carlo expectation-maximization with modified cross-entropy method ('bursty MCEM(2)'), an efficient parameter estimation and model selection technique for inferring the number and configuration of promoter states from single-cell gene expression data. Application of bursty MCEM(2) to data from the endogenous mouse glutaminase promoter reveals nearly deterministic promoter OFF times, consistent with a multi-step activation mechanism consisting of 10 or more inactive states. Our novel approach to modeling promoter fluctuations together with bursty MCEM(2) provides powerful tools for characterizing transcriptional bursting across genes under different environmental conditions.
R source code implementing bursty MCEM(2) is available upon request.
Supplementary data are available at Bioinformatics online.
转录活跃(开启)状态和非活跃(关闭)状态之间的随机启动子切换是基因表达中噪声的主要来源。人们常常隐含地假设启动子状态之间的转换是无记忆的,即启动子在两种状态中的每一种状态下花费的时间间隔呈指数分布。然而,越来越多的证据表明启动子的开启/关闭时间可能是非指数的,这暗示着存在更复杂的转录调控结构。鉴于基因表达在所有细胞功能中的重要作用,迫切需要有效的计算技术来表征启动子结构。
我们针对具有任意数量开启和关闭状态的启动子开发了一种新颖的模型简化方法,使我们能够用威布尔分布的开启/关闭时间来近似复杂的启动子切换行为。利用这种模型简化,我们创建了带有修正交叉熵方法的爆发式蒙特卡罗期望最大化算法(“爆发式MCEM(2)”),这是一种用于从单细胞基因表达数据推断启动子状态数量和配置的有效参数估计和模型选择技术。将爆发式MCEM(2)应用于内源性小鼠谷氨酰胺酶启动子的数据,揭示了几乎确定性的启动子关闭时间,这与由10个或更多非活跃状态组成的多步激活机制一致。我们用于对启动子波动进行建模的新方法与爆发式MCEM(2)一起,为表征不同环境条件下基因间的转录爆发提供了强大的工具。
如需实现爆发式MCEM(2)的R源代码可提供。
补充数据可在《生物信息学》在线获取。