Wilczyński Bartek, Hvidsten Torgeir R, Kryshtafovych Andriy, Tiuryn Jerzy, Komorowski Jan, Fidelis Krzysztof
lnstitute of Mathematics, Polish Academy of Sciences, Warsaw, Poland.
BMC Bioinformatics. 2006 Nov 17;7:505. doi: 10.1186/1471-2105-7-505.
We present an approach designed to identify gene regulation patterns using sequence and expression data collected for Saccharomyces cerevisae. Our main goal is to relate the combinations of transcription factor binding sites (also referred to as binding site modules) identified in gene promoters to the expression of these genes. The novel aspects include local expression similarity clustering and an exact IF-THEN rule inference algorithm. We also provide a method of rule generalization to include genes with unknown expression profiles.
We have implemented the proposed framework and tested it on publicly available datasets from yeast S. cerevisae. The testing procedure consists of thorough statistical analyses of the groups of genes matching the rules we infer from expression data against known sets of co-regulated genes. For this purpose we have used published ChIP-Chip data and Gene Ontology annotations. In order to make these tests more objective we compare our results with recently published similar studies.
Results we obtain show that local expression similarity clustering greatly enhances overall quality of the derived rules, both in terms of enrichment of Gene Ontology functional annotation and coherence with ChIP-Chip binding data. Our approach thus provides reliable hypotheses on co-regulation that can be experimentally verified. An important feature of the method is its reliance only on widely accessible sequence and expression data. The same procedure can be easily applied to other microbial organisms.
我们提出了一种利用酿酒酵母收集的序列和表达数据来识别基因调控模式的方法。我们的主要目标是将在基因启动子中鉴定出的转录因子结合位点组合(也称为结合位点模块)与这些基因的表达联系起来。新的方面包括局部表达相似性聚类和精确的“如果-那么”规则推理算法。我们还提供了一种规则泛化方法,以纳入表达谱未知的基因。
我们已经实现了所提出的框架,并在来自酿酒酵母的公开可用数据集上对其进行了测试。测试过程包括对与我们从表达数据推断出的规则相匹配的基因组与已知的共调控基因集进行全面的统计分析。为此,我们使用了已发表的芯片结合位点分析(ChIP-Chip)数据和基因本体注释。为了使这些测试更客观,我们将我们的结果与最近发表的类似研究进行了比较。
我们获得的结果表明,局部表达相似性聚类在基因本体功能注释的富集和与芯片结合位点分析(ChIP-Chip)结合数据的一致性方面都大大提高了衍生规则的整体质量。因此,我们的方法提供了关于共调控的可靠假设,可以通过实验进行验证。该方法的一个重要特点是它仅依赖于广泛可用的序列和表达数据。相同的程序可以很容易地应用于其他微生物。