Yan Xifeng, Mehan Michael R, Huang Yu, Waterman Michael S, Yu Philip S, Zhou Xianghong Jasmine
IBM T. J. Watson Research Center, Hawthorne, NY, USA.
Bioinformatics. 2007 Jul 1;23(13):i577-86. doi: 10.1093/bioinformatics/btm227.
A major challenge in studying gene regulation is to systematically reconstruct transcription regulatory modules, which are defined as sets of genes that are regulated by a common set of transcription factors. A commonly used approach for transcription module reconstruction is to derive coexpression clusters from a microarray dataset. However, such results often contain false positives because genes from many transcription modules may be simultaneously perturbed upon a given type of conditions. In this study, we propose and validate that genes, which form a coexpression cluster in multiple microarray datasets across diverse conditions, are more likely to form a transcription module. However, identifying genes coexpressed in a subset of many microarray datasets is not a trivial computational problem.
We propose a graph-based data-mining approach to efficiently and systematically identify frequent coexpression clusters. Given m microarray datasets, we model each microarray dataset as a coexpression graph, and search for vertex sets which are frequently densely connected across [theta m] datasets (0 < or = theta < or = 1). For this novel graph-mining problem, we designed two techniques to narrow down the search space: (1) partition the input graphs into (overlapping) groups sharing common properties; (2) summarize the vertex neighbor information from the partitioned datasets onto the 'Neighbor Association Summary Graph's for effective mining. We applied our method to 105 human microarray datasets, and identified a large number of potential transcription modules, activated under different subsets of conditions. Validation by ChIP-chip data demonstrated that the likelihood of a coexpression cluster being a transcription module increases significantly with its recurrence. Our method opens a new way to exploit the vast amount of existing microarray data accumulation for gene regulation study. Furthermore, the algorithm is applicable to other biological networks for approximate network module mining.
研究基因调控的一个主要挑战是系统地重建转录调控模块,转录调控模块被定义为由一组共同的转录因子调控的基因集合。转录模块重建的一种常用方法是从微阵列数据集中推导共表达聚类。然而,这样的结果往往包含假阳性,因为许多转录模块的基因在给定类型的条件下可能同时受到干扰。在本研究中,我们提出并验证,在不同条件下的多个微阵列数据集中形成共表达聚类的基因更有可能形成一个转录模块。然而,识别在许多微阵列数据集的子集中共表达的基因并非一个简单的计算问题。
我们提出一种基于图的数据挖掘方法,以高效且系统地识别频繁共表达聚类。给定m个微阵列数据集,我们将每个微阵列数据集建模为一个共表达图,并搜索在[θm]个数据集(0≤θ≤1)中频繁紧密连接的顶点集。针对这个新颖的图挖掘问题,我们设计了两种技术来缩小搜索空间:(1)将输入图划分为具有共同属性的(重叠)组;(2)将来自划分后数据集的顶点邻居信息汇总到“邻居关联汇总图”上以进行有效挖掘。我们将我们的方法应用于105个人类微阵列数据集,并识别出大量在不同条件子集下被激活的潜在转录模块。通过芯片-芯片数据验证表明,共表达聚类作为转录模块的可能性随着其重现性而显著增加。我们的方法为利用大量现有的微阵列数据积累进行基因调控研究开辟了一条新途径。此外,该算法适用于其他生物网络以进行近似网络模块挖掘。