Center for Intelligent Data Analysis, School of Information Systems, Computing and Mathematics, Brunel University, Uxbridge, Middlesex, UB8 3PH, UK.
BMC Bioinformatics. 2010 Jan 15;11:32. doi: 10.1186/1471-2105-11-32.
In microarray data analysis, factors such as data quality, biological variation, and the increasingly multi-layered nature of more complex biological systems complicates the modelling of regulatory networks that can represent and capture the interactions among genes. We believe that the use of multiple datasets derived from related biological systems leads to more robust models. Therefore, we developed a novel framework for modelling regulatory networks that involves training and evaluation on independent datasets. Our approach includes the following steps: (1) ordering the datasets based on their level of noise and informativeness; (2) selection of a Bayesian classifier with an appropriate level of complexity by evaluation of predictive performance on independent data sets; (3) comparing the different gene selections and the influence of increasing the model complexity; (4) functional analysis of the informative genes.
In this paper, we identify the most appropriate model complexity using cross-validation and independent test set validation for predicting gene expression in three published datasets related to myogenesis and muscle differentiation. Furthermore, we demonstrate that models trained on simpler datasets can be used to identify interactions among genes and select the most informative. We also show that these models can explain the myogenesis-related genes (genes of interest) significantly better than others (P < 0.004) since the improvement in their rankings is much more pronounced. Finally, after further evaluating our results on synthetic datasets, we show that our approach outperforms a concordance method by Lai et al. in identifying informative genes from multiple datasets with increasing complexity whilst additionally modelling the interaction between genes.
We show that Bayesian networks derived from simpler controlled systems have better performance than those trained on datasets from more complex biological systems. Further, we present that highly predictive and consistent genes, from the pool of differentially expressed genes, across independent datasets are more likely to be fundamentally involved in the biological process under study. We conclude that networks trained on simpler controlled systems, such as in vitro experiments, can be used to model and capture interactions among genes in more complex datasets, such as in vivo experiments, where these interactions would otherwise be concealed by a multitude of other ongoing events.
在微阵列数据分析中,数据质量、生物变异性以及越来越复杂的生物系统的多层次性质等因素使得构建能够表示和捕获基因之间相互作用的调控网络模型变得复杂。我们认为,使用来自相关生物系统的多个数据集可以得到更稳健的模型。因此,我们开发了一种新的用于建模调控网络的框架,该框架涉及在独立数据集上进行训练和评估。我们的方法包括以下步骤:(1)根据噪声和信息量对数据集进行排序;(2)通过在独立数据集上评估预测性能来选择具有适当复杂度的贝叶斯分类器;(3)比较不同的基因选择和增加模型复杂度的影响;(4)对信息基因进行功能分析。
在本文中,我们使用交叉验证和独立测试集验证来确定最适合的模型复杂度,以预测与肌发生和肌肉分化相关的三个已发表数据集的基因表达。此外,我们证明,在更简单的数据集上训练的模型可用于识别基因之间的相互作用并选择最具信息量的基因。我们还表明,这些模型可以比其他模型(P < 0.004)更好地解释与肌发生相关的基因(感兴趣的基因),因为它们的排名提高更为明显。最后,在对合成数据集进行进一步评估后,我们表明我们的方法在识别来自多个数据集的信息基因方面优于 Lai 等人的一致性方法,同时还对基因之间的相互作用进行建模。
我们表明,从更简单的控制系统中得出的贝叶斯网络比从更复杂的生物系统数据集训练的网络具有更好的性能。此外,我们提出了从独立数据集的差异表达基因中筛选出高度可预测且一致的基因,更有可能从根本上参与所研究的生物学过程。我们得出的结论是,从简单的控制系统(如体外实验)训练的网络可以用于建模和捕获更复杂数据集(如体内实验)中的基因相互作用,否则这些相互作用将被其他许多正在进行的事件所掩盖。