Suppr超能文献

结合序列和时间序列表达数据以学习转录模块。

Combining sequence and time series expression data to learn transcriptional modules.

作者信息

Kundaje Anshul, Middendorf Manuel, Gao Feng, Wiggins Chris, Leslie Christina

机构信息

Department of Computer Science, Columbia University, New York 10027, USA.

出版信息

IEEE/ACM Trans Comput Biol Bioinform. 2005 Jul-Sep;2(3):194-202. doi: 10.1109/TCBB.2005.34.

Abstract

Our goal is to cluster genes into transcriptional modules--sets of genes where similarity in expression is explained by common regulatory mechanisms at the transcriptional level. We want to learn modules from both time series gene expression data and genome-wide motif data that are now readily available for organisms such as S. cereviseae as a result of prior computational studies or experimental results. We present a generative probabilistic model for combining regulatory sequence and time series expression data to cluster genes into coherent transcriptional modules. Starting with a set of motifs representing known or putative regulatory elements (transcription factor binding sites) and the counts of occurrences of these motifs in each gene's promoter region, together with a time series expression profile for each gene, the learning algorithm uses expectation maximization to learn module assignments based on both types of data. We also present a technique based on the Jensen-Shannon entropy contributions of motifs in the learned model for associating the most significant motifs to each module. Thus, the algorithm gives a global approach for associating sets of regulatory elements to "modules" of genes with similar time series expression profiles. The model for expression data exploits our prior belief of smooth dependence on time by using statistical splines and is suitable for typical time course data sets with relatively few experiments. Moreover, the model is sufficiently interpretable that we can understand how both sequence data and expression data contribute to the cluster assignments, and how to interpolate between the two data sources. We present experimental results on the yeast cell cycle to validate our method and find that our combined expression and motif clustering algorithm discovers modules with both coherent expression and similar motif patterns, including binding motifs associated to known cell cycle transcription factors.

摘要

我们的目标是将基因聚类到转录模块中,即一组基因,其表达的相似性由转录水平上的共同调控机制来解释。我们希望从时间序列基因表达数据和全基因组基序数据中学习模块,由于先前的计算研究或实验结果,现在这些数据对于诸如酿酒酵母等生物体来说很容易获得。我们提出了一种生成概率模型,用于结合调控序列和时间序列表达数据,将基因聚类成连贯的转录模块。从一组代表已知或推定调控元件(转录因子结合位点)的基序以及这些基序在每个基因启动子区域中的出现次数开始,再加上每个基因的时间序列表达谱,学习算法使用期望最大化基于这两种数据来学习模块分配。我们还提出了一种基于所学模型中基序的詹森 - 香农熵贡献的技术,用于将最重要的基序与每个模块相关联。因此,该算法提供了一种全局方法,将调控元件集与具有相似时间序列表达谱的基因“模块”相关联。表达数据模型通过使用统计样条利用了我们对时间平滑依赖性的先验信念,适用于实验相对较少的典型时间进程数据集。此外,该模型具有足够的可解释性,我们可以理解序列数据和表达数据如何对聚类分配做出贡献,以及如何在这两个数据源之间进行插值。我们展示了关于酵母细胞周期的实验结果以验证我们的方法,并且发现我们的组合表达和基序聚类算法发现了具有连贯表达和相似基序模式的模块,包括与已知细胞周期转录因子相关的结合基序。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验