Ihmels Jan, Bergmann Sven, Barkai Naama
Department of Molecular Genetics, Weizmann Institute of Science, Rehovot, Israel.
Bioinformatics. 2004 Sep 1;20(13):1993-2003. doi: 10.1093/bioinformatics/bth166. Epub 2004 Mar 25.
Large-scale gene expression data comprising a variety of cellular conditions hold the promise of a global view on the transcription program. While conventional clustering algorithms have been successfully applied to smaller datasets, the utility of many algorithms for the analysis of large-scale data is limited by their inability to capture combinatorial and condition-specific co-regulation. In addition, there is an increasing need to integrate the rapidly accumulating body of other high-throughput biological data with the expression analysis. In a previous work, we introduced the signature algorithm, which overcomes the problems of conventional clustering and allows for intuitive integration of additional biological data. However, this approach is constrained by the comprehensiveness of relevant external data and its lacking ability to capture hierarchical modularity.
We present a novel method for the analysis of large-scale expression data, which assigns genes into context-dependent and potentially overlapping regulatory units. We introduce the notion of a transcription module as a self-consistent regulatory unit consisting of a set of co-regulated genes as well as the experimental conditions that induce their co-regulation. Self-consistency is defined by a rigorous mathematical criterion. We propose an efficient algorithm to identify such modules, which is based on the iterative application of the signature algorithm. A threshold parameter that determines the resolution of the modular decomposition is introduced.
The method is applied systematically to over 1000 expression profiles of the yeast Saccharomyces cerevisiae, and the results are presented using two complementary visualization schemes we developed. The average biological coherence, as measured by the conservation of putative cis-regulatory motifs between four related yeast species, is higher for transcription modules than for clusters identified by other methods applied to the same dataset. Our method is related to singular value decomposition (SVD) and to the pairwise average linkage clustering algorithm. It extends SVD by filtering out noise in the expression data and offering variable resolution to reveal hierarchical organization. It furthermore has the advantage over both methods of capturing overlapping modules in the presence of combinatorial regulation.
包含各种细胞条件的大规模基因表达数据有望提供转录程序的全局视图。虽然传统聚类算法已成功应用于较小的数据集,但许多算法在分析大规模数据时的效用受到其无法捕捉组合式和条件特异性共调控的限制。此外,将快速积累的其他高通量生物数据与表达分析进行整合的需求日益增加。在之前的一项工作中,我们引入了特征算法,该算法克服了传统聚类的问题,并允许直观地整合额外的生物数据。然而,这种方法受到相关外部数据全面性的限制,并且缺乏捕捉层次模块化的能力。
我们提出了一种分析大规模表达数据的新方法,该方法将基因分配到上下文相关且可能重叠的调控单元中。我们引入了转录模块的概念,将其作为一个自洽的调控单元,由一组共调控基因以及诱导它们共调控的实验条件组成。自洽性由一个严格的数学标准定义。我们提出了一种基于特征算法的迭代应用来识别此类模块的高效算法。引入了一个决定模块化分解分辨率的阈值参数。
该方法被系统地应用于酿酒酵母的1000多个表达谱,并使用我们开发的两种互补可视化方案展示结果。通过四个相关酵母物种之间假定的顺式调控基序的保守性来衡量,转录模块的平均生物学一致性高于应用于同一数据集的其他方法所识别的聚类。我们的方法与奇异值分解(SVD)和成对平均连锁聚类算法相关。它通过滤除表达数据中的噪声并提供可变分辨率以揭示层次组织来扩展SVD。此外,在存在组合调控的情况下,它比这两种方法都具有捕捉重叠模块的优势。