De Souza Jacomini Ricardo, Martins David Correa, Da Silva Felipe Leno, Costa Anna Helena Reali
1 Escola Politécnica da Universidade de São Paulo , São Paulo, Brazil .
2 Universidade Federal do ABC , Santo André, Brazil .
J Comput Biol. 2017 Aug;24(8):809-830. doi: 10.1089/cmb.2017.0022. Epub 2017 Jun 21.
Gene network (GN) inference from temporal gene expression data is a crucial and challenging problem in systems biology. Expression data sets usually consist of dozens of temporal samples, while networks consist of thousands of genes, thus rendering many inference methods unfeasible in practice. To improve the scalability of GN inference methods, we propose a novel framework called GeNICE, based on probabilistic GNs; the main novelty is the introduction of a clustering procedure to group genes with related expression profiles and to provide an approximate solution with reduced computational complexity. We use the defined clusters to perform an exhaustive search to retrieve the best predictor gene subsets for each target gene, according to multivariate criterion functions. GeNICE greatly reduces the search space because predictor candidates are restricted to one gene per cluster. Finally, a multivariate analysis is performed for each defined predictor subset to retrieve minimal subsets and to simplify the network. In our experiments with in silico generated data sets, GeNICE achieved substantial computational time reduction when compared to solutions without the clustering step, while preserving the gene expression prediction accuracy even when the number of clusters is small (about 50) relative to the number of genes (order of thousands). For a Plasmodium falciparum microarray data set, the prediction accuracy achieved by GeNICE was roughly 97%, while the respective topologies involving glycolytic and apicoplast seed genes had a very large intramodularity, very small interconnection between modules, and some module hub genes, reflecting small-world and scale-free topological properties, as expected.
从时间基因表达数据推断基因网络(GN)是系统生物学中一个关键且具有挑战性的问题。表达数据集通常由数十个时间样本组成,而网络由数千个基因组成,这使得许多推断方法在实际中不可行。为了提高GN推断方法的可扩展性,我们基于概率基因网络提出了一种名为GeNICE的新颖框架;主要新颖之处在于引入了一种聚类程序,对具有相关表达谱的基因进行分组,并提供具有降低计算复杂度的近似解。我们使用定义的聚类进行穷举搜索,根据多变量准则函数为每个目标基因检索最佳预测基因子集。由于预测候选基因被限制为每个聚类一个基因,GeNICE大大减少了搜索空间。最后,对每个定义的预测子集进行多变量分析,以检索最小子集并简化网络。在我们对计算机生成的数据集进行的实验中,与没有聚类步骤的解决方案相比,GeNICE显著减少了计算时间,即使聚类数量相对于基因数量(数千个量级)较少(约50个)时,也能保持基因表达预测准确性。对于恶性疟原虫微阵列数据集,GeNICE实现的预测准确率约为97%,而涉及糖酵解和顶质体种子基因的相应拓扑结构具有非常大的模块内聚性、模块之间非常小的互连性以及一些模块中心基因,如预期的那样反映了小世界和无标度拓扑特性。