Klie Sebastian, Nikoloski Zoran, Selbig Joachim
1 Max-Planck Institute for Molecular Plant Physiology , Potsdam, Brandenburg, Germany .
J Comput Biol. 2014 Jun;21(6):428-45. doi: 10.1089/cmb.2009.0129. Epub 2010 Jan 8.
Recent advances in high-throughput omics techniques render it possible to decode the function of genes by using the "guilt-by-association" principle on biologically meaningful clusters of gene expression data. However, the existing frameworks for biological evaluation of gene clusters are hindered by two bottleneck issues: (1) the choice for the number of clusters, and (2) the external measures which do not take in consideration the structure of the analyzed data and the ontology of the existing biological knowledge. Here, we address the identified bottlenecks by developing a novel framework that allows not only for biological evaluation of gene expression clusters based on existing structured knowledge, but also for prediction of putative gene functions. The proposed framework facilitates propagation of statistical significance at each of the following steps: (1) estimating the number of clusters, (2) evaluating the clusters in terms of novel external structural measures, (3) selecting an optimal clustering algorithm, and (4) predicting gene functions. The framework also includes a method for evaluation of gene clusters based on the structure of the employed ontology. Moreover, our method for obtaining a probabilistic range for the number of clusters is demonstrated valid on synthetic data and available gene expression profiles from Saccharomyces cerevisiae. Finally, we propose a network-based approach for gene function prediction which relies on the clustering of optimal score and the employed ontology. Our approach effectively predicts gene function on the Saccharomyces cerevisiae data set and is also employed to obtain putative gene functions for an Arabidopsis thaliana data set.
高通量组学技术的最新进展使得通过对具有生物学意义的基因表达数据聚类运用“关联有罪”原则来解码基因功能成为可能。然而,现有的基因簇生物学评估框架受到两个瓶颈问题的阻碍:(1)簇数量的选择,以及(2)外部度量未考虑所分析数据的结构和现有生物学知识的本体。在此,我们通过开发一种新颖的框架来解决已确定的瓶颈,该框架不仅允许基于现有结构化知识对基因表达簇进行生物学评估,还能预测推定的基因功能。所提出的框架在以下每个步骤中都有助于统计显著性的传播:(1)估计簇的数量,(2)根据新颖的外部结构度量评估簇,(3)选择最优聚类算法,以及(4)预测基因功能。该框架还包括一种基于所采用本体的结构评估基因簇的方法。此外,我们获得簇数量概率范围的方法在合成数据和酿酒酵母可用基因表达谱上被证明是有效的。最后,我们提出一种基于网络的基因功能预测方法,该方法依赖于最优分数的聚类和所采用的本体。我们的方法有效地在酿酒酵母数据集上预测了基因功能,并且还被用于获取拟南芥数据集的推定基因功能。