Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America.
Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America.
PLoS Comput Biol. 2018 Sep 26;14(9):e1006459. doi: 10.1371/journal.pcbi.1006459. eCollection 2018 Sep.
Studying a gene's regulatory mechanisms is a tedious process that involves identification of candidate regulators by transcription factor (TF) knockout or over-expression experiments, delineation of enhancers by reporter assays, and demonstration of direct TF influence by site mutagenesis, among other approaches. Such experiments are often chosen based on the biologist's intuition, from several testable hypotheses. We pursue the goal of making this process systematic by using ideas from information theory to reason about experiments in gene regulation, in the hope of ultimately enabling rigorous experiment design strategies. For this, we make use of a state-of-the-art mathematical model of gene expression, which provides a way to formalize our current knowledge of cis- as well as trans- regulatory mechanisms of a gene. Ambiguities in such knowledge can be expressed as uncertainties in the model, which we capture formally by building an ensemble of plausible models that fit the existing data and defining a probability distribution over the ensemble. We then characterize the impact of a new experiment on our understanding of the gene's regulation based on how the ensemble of plausible models and its probability distribution changes when challenged with results from that experiment. This allows us to assess the 'value' of the experiment retroactively as the reduction in entropy of the distribution (information gain) resulting from the experiment's results. We fully formalize this novel approach to reasoning about gene regulation experiments and use it to evaluate a variety of perturbation experiments on two developmental genes of D. melanogaster. We also provide objective and 'biologist-friendly' descriptions of the information gained from each such experiment. The rigorously defined information theoretic approaches presented here can be used in the future to formulate systematic strategies for experiment design pertaining to studies of gene regulatory mechanisms.
研究基因的调控机制是一个繁琐的过程,通常需要通过转录因子(TF)敲除或过表达实验来鉴定候选调控因子,通过报告基因实验来描绘增强子,以及通过位点突变来证明直接的 TF 影响等方法。这些实验通常是基于生物学家的直觉,从几个可测试的假设中选择的。我们希望通过使用信息论的思想来系统性地研究基因调控中的实验,最终实现严格的实验设计策略。为此,我们利用了一种最先进的基因表达数学模型,该模型为我们形式化地表达基因顺式和反式调控机制的当前知识提供了一种方法。这种知识中的歧义可以表示为模型中的不确定性,我们通过构建一组拟合现有数据的合理模型并定义模型集合上的概率分布来正式捕获这些不确定性。然后,我们根据新实验对我们对基因调控理解的影响,通过比较该实验结果前后合理模型集合及其概率分布的变化来进行特征描述。这使我们能够根据实验结果导致的分布熵减少(信息增益)来回溯性地评估实验的“价值”。我们充分形式化了这种用于基因调控实验推理的新方法,并将其应用于评估两种黑腹果蝇发育基因的各种扰动实验。我们还提供了从每个实验中获得的信息的客观和“生物学家友好”的描述。这里提出的严格定义的信息论方法可用于未来制定与基因调控机制研究相关的系统实验设计策略。