Kim Minseung, Zorraquino Violeta, Tagkopoulos Ilias
Department of Computer Science, University of California, Davis, Davis, California, United States of America; UC Davis Genome Center, University of California, Davis, Davis, California, United States of America.
UC Davis Genome Center, University of California, Davis, Davis, California, United States of America.
PLoS Comput Biol. 2015 Mar 16;11(3):e1004127. doi: 10.1371/journal.pcbi.1004127. eCollection 2015 Mar.
A tantalizing question in cellular physiology is whether the cellular state and environmental conditions can be inferred by the expression signature of an organism. To investigate this relationship, we created an extensive normalized gene expression compendium for the bacterium Escherichia coli that was further enriched with meta-information through an iterative learning procedure. We then constructed an ensemble method to predict environmental and cellular state, including strain, growth phase, medium, oxygen level, antibiotic and carbon source presence. Results show that gene expression is an excellent predictor of environmental structure, with multi-class ensemble models achieving balanced accuracy between 70.0% (±3.5%) to 98.3% (±2.3%) for the various characteristics. Interestingly, this performance can be significantly boosted when environmental and strain characteristics are simultaneously considered, as a composite classifier that captures the inter-dependencies of three characteristics (medium, phase and strain) achieved 10.6% (±1.0%) higher performance than any individual models. Contrary to expectations, only 59% of the top informative genes were also identified as differentially expressed under the respective conditions. Functional analysis of the respective genetic signatures implicates a wide spectrum of Gene Ontology terms and KEGG pathways with condition-specific information content, including iron transport, transferases, and enterobactin synthesis. Further experimental phenotypic-to-genotypic mapping that we conducted for knock-out mutants argues for the information content of top-ranked genes. This work demonstrates the degree at which genome-scale transcriptional information can be predictive of latent, heterogeneous and seemingly disparate phenotypic and environmental characteristics, with far-reaching applications.
细胞生理学中一个引人入胜的问题是,生物体的表达特征是否能够推断细胞状态和环境条件。为了研究这种关系,我们为大肠杆菌创建了一个广泛的标准化基因表达汇编,并通过迭代学习过程进一步丰富了元信息。然后,我们构建了一种集成方法来预测环境和细胞状态,包括菌株、生长阶段、培养基、氧气水平、抗生素和碳源的存在情况。结果表明,基因表达是环境结构的优秀预测指标,多类集成模型对各种特征的平衡准确率在70.0%(±3.5%)至98.3%(±2.3%)之间。有趣的是,当同时考虑环境和菌株特征时,这种性能可以显著提高,因为一个捕捉三种特征(培养基、生长阶段和菌株)相互依赖关系的复合分类器比任何单个模型的性能高出10.6%(±1.0%)。与预期相反,在各自条件下,只有59%的顶级信息基因也被鉴定为差异表达基因。对各自基因特征的功能分析涉及广泛的基因本体论术语和具有条件特异性信息内容的KEGG途径,包括铁运输、转移酶和肠杆菌素合成。我们对基因敲除突变体进行的进一步实验性表型到基因型映射证明了顶级基因的信息含量。这项工作证明了基因组规模转录信息能够预测潜在的、异质的和看似不同的表型和环境特征的程度,具有深远的应用价值。