Ferrazzi Fulvia, Magni Paolo, Sacchi Lucia, Nuzzo Angelo, Petrovic Uros, Bellazzi Riccardo
Dipartimento di Informatica e Sistemistica, Università degli Studi di Pavia, via Ferrata 1, 27100 Pavia, Italy.
Int J Med Inform. 2007 Dec;76 Suppl 3:S462-75. doi: 10.1016/j.ijmedinf.2007.07.005. Epub 2007 Sep 6.
The purpose of the paper is to propose a methodology for learning gene regulatory networks from DNA microarray data based on the integration of different data and knowledge sources. We applied our method to Saccharomyces cerevisiae experiments, focusing our attention on cell cycle regulatory mechanisms. We exploited data from deletion mutant experiments (static data), gene expression time series (dynamic data) and the knowledge encoded in the Gene Ontology.
The proposed method is based on four phases. An initial gene network was derived from static data by means of a simple statistical approach. Then, the genes classified in the Gene Ontology as being involved in the cell cycle were selected. As a third step, the network structure was used to initialize a linear dynamic model of gene expression profiles. Finally, a genetic algorithm was applied to update the gene network exploiting data coming from an experiment on the yeast cell cycle.
We compared the network models provided by our approach with those obtained with a fully data-driven approach, by looking at their AIC scores and at the percentage of preserved connections in the best solutions. The results show that several nearly equivalent solutions, in terms of AIC scores, can be found. This problem is greatly mitigated by following our approach, which is able to find more robust models by fixing a portion of the network structure on the basis of prior knowledge. The best network structure was biologically evaluated on a set of 22 known cell cycle genes against independent knowledge sources.
An approach able to integrate several sources of information is needed to infer gene regulatory networks, as a fully data-driven search is in general prone to overfitting and to unidentifiability problems. The learned networks encode hypotheses on regulatory relationships that need to be verified by means of wet-lab experiments.
本文旨在提出一种基于整合不同数据和知识来源,从DNA微阵列数据中学习基因调控网络的方法。我们将该方法应用于酿酒酵母实验,重点关注细胞周期调控机制。我们利用了缺失突变体实验数据(静态数据)、基因表达时间序列(动态数据)以及基因本体中编码的知识。
所提出的方法基于四个阶段。首先通过简单的统计方法从静态数据中导出初始基因网络。然后,选择在基因本体中被分类为参与细胞周期的基因。第三步,使用网络结构初始化基因表达谱的线性动态模型。最后,应用遗传算法利用来自酵母细胞周期实验的数据更新基因网络。
我们通过查看AIC分数以及最佳解决方案中保留连接的百分比,将我们方法提供的网络模型与通过完全数据驱动方法获得的模型进行了比较。结果表明,就AIC分数而言,可以找到几个几乎等效的解决方案。通过遵循我们的方法,这个问题得到了极大缓解,该方法能够通过基于先验知识固定一部分网络结构来找到更稳健的模型。在一组22个已知细胞周期基因上,针对独立知识来源对最佳网络结构进行了生物学评估。
推断基因调控网络需要一种能够整合多种信息来源的方法,因为完全数据驱动的搜索通常容易出现过拟合和不可识别问题。所学习的网络编码了关于调控关系的假设,这些假设需要通过湿实验室实验来验证。