Université Paris-Saclay, INRAE, MaIAGE, Jouy-en-Josas, France.
J R Soc Interface. 2020 Oct;17(171):20200600. doi: 10.1098/rsif.2020.0600. Epub 2020 Oct 7.
Automatic de novo identification of the main regulons of a bacterium from genome and transcriptome data remains a challenge. To address this task, we propose a statistical model that can use information on exact positions of the transcription start sites and condition-dependent expression profiles. The central idea of this model is to improve the probabilistic representation of the promoter DNA sequences by incorporating covariates summarizing expression profiles (e.g. coordinates in projection spaces or hierarchical clustering trees). A dedicated trans-dimensional Markov chain Monte Carlo algorithm adjusts the width and palindromic properties of the corresponding position-weight matrices, the number of parameters to describe exact position relative to the transcription start site, and chooses the expression covariates relevant for each motif. All parameters are estimated simultaneously, for many motifs and many expression covariates. The method is applied to a dataset of transcription start sites and expression profiles available for . The results validate the approach and provide a new global view of the transcription regulatory network of this important pathogen. Remarkably, a previously unreported motif is found in promoter regions of ribosomal protein genes, suggesting a role in the regulation of growth.
从基因组和转录组数据中自动发现细菌的主要调控子仍然是一个挑战。为了解决这个问题,我们提出了一个统计模型,该模型可以利用转录起始位点的精确位置和条件依赖性表达谱的信息。该模型的核心思想是通过将概括表达谱的协变量(例如,在投影空间或层次聚类树中的坐标)纳入到启动子 DNA 序列的概率表示中,来改进启动子 DNA 序列的概率表示。专门的跨维马尔可夫链蒙特卡罗算法调整了相应位置权重矩阵的宽度和回文特性、描述相对于转录起始位点的精确位置的参数数量,并为每个基序选择与表达相关的协变量。所有参数都同时针对许多基序和许多表达协变量进行估计。该方法应用于可用于的转录起始位点和表达谱数据集。结果验证了该方法,并提供了该重要病原体转录调控网络的新的全局视图。值得注意的是,在核糖体蛋白基因的启动子区域中发现了一个以前未报道的基序,这表明它在生长调控中发挥作用。