Department of Bioengineering, University of California at San Diego, La Jolla, CA 92093.
Bioinformatics and Systems Biology Program, University of California at San Diego, La Jolla, CA 92093.
Proc Natl Acad Sci U S A. 2017 Sep 19;114(38):10286-10291. doi: 10.1073/pnas.1702581114. Epub 2017 Sep 5.
Transcriptional regulatory networks (TRNs) have been studied intensely for >25 y. Yet, even for the TRN-probably the best characterized TRN-several questions remain. Here, we address three questions: () How complete is our knowledge of the TRN; () how well can we predict gene expression using this TRN; and () how robust is our understanding of the TRN? First, we reconstructed a high-confidence TRN (hiTRN) consisting of 147 transcription factors (TFs) regulating 1,538 transcription units (TUs) encoding 1,764 genes. The 3,797 high-confidence regulatory interactions were collected from published, validated chromatin immunoprecipitation (ChIP) data and RegulonDB. For 21 different TF knockouts, up to 63% of the differentially expressed genes in the hiTRN were traced to the knocked-out TF through regulatory cascades. Second, we trained supervised machine learning algorithms to predict the expression of 1,364 TUs given TF activities using 441 samples. The algorithms accurately predicted condition-specific expression for 86% (1,174 of 1,364) of the TUs, while 193 TUs (14%) were predicted better than random TRNs. Third, we identified 10 regulatory modules whose definitions were robust against changes to the TRN or expression compendium. Using surrogate variable analysis, we also identified three unmodeled factors that systematically influenced gene expression. Our computational workflow comprehensively characterizes the predictive capabilities and systems-level functions of an organism's TRN from disparate data types.
转录调控网络 (TRN) 的研究已经进行了超过 25 年。然而,即使对于 TRN——可能是特征研究得最好的 TRN——仍有几个问题悬而未决。在这里,我们提出了三个问题:(1)我们对 TRN 的了解有多完整;(2)我们使用这个 TRN 预测基因表达的能力有多好;(3)我们对 TRN 的理解有多稳健?首先,我们构建了一个由 147 个转录因子(TFs)调控 1,538 个转录单元(TUs)的高可信度 TRN(hiTRN),这些 TUs 编码 1,764 个基因。从已发表的、经过验证的染色质免疫沉淀(ChIP)数据和 RegulonDB 中收集了 3,797 个高可信度的调控相互作用。对于 21 个不同的 TF 敲除,通过调控级联,在 hiTRN 中多达 63%的差异表达基因可以追溯到被敲除的 TF。其次,我们使用 441 个样本训练了监督机器学习算法,根据 TF 活性预测 1,364 个 TU 的表达。对于 1,364 个 TU 中的 86%(1,174 个),算法可以准确地预测特定条件下的表达,而 193 个 TU(14%)的预测比随机 TRN 更好。第三,我们鉴定了 10 个调控模块,它们的定义在 TRN 或表达综合数据库发生变化时具有稳健性。通过替代变量分析,我们还鉴定了三个未建模的因素,这些因素系统地影响基因表达。我们的计算工作流程全面描述了从不同数据类型中获得的生物体 TRN 的预测能力和系统级功能。