Wong Ka-Chun, Li Yue, Peng Chengbin, Zhang Zhaolei
Department of Computer Science and Terrence Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada, CEMSE Division, King Abdullah University of Science and Technology, Thuwal, Jeddah, K.S.A., Banting and Best Department of Medical Research and Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada Department of Computer Science and Terrence Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada, CEMSE Division, King Abdullah University of Science and Technology, Thuwal, Jeddah, K.S.A., Banting and Best Department of Medical Research and Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada.
Department of Computer Science and Terrence Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada, CEMSE Division, King Abdullah University of Science and Technology, Thuwal, Jeddah, K.S.A., Banting and Best Department of Medical Research and Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada.
Bioinformatics. 2015 Jan 1;31(1):17-24. doi: 10.1093/bioinformatics/btu604. Epub 2014 Sep 5.
Chromatin immunoprecipitation (ChIP) followed by high-throughput sequencing (ChIP-Seq) measures the genome-wide occupancy of transcription factors in vivo. Different combinations of DNA-binding protein occupancies may result in a gene being expressed in different tissues or at different developmental stages. To fully understand the functions of genes, it is essential to develop probabilistic models on multiple ChIP-Seq profiles to decipher the combinatorial regulatory mechanisms by multiple transcription factors.
In this work, we describe a probabilistic model (SignalSpider) to decipher the combinatorial binding events of multiple transcription factors. Comparing with similar existing methods, we found SignalSpider performs better in clustering promoter and enhancer regions. Notably, SignalSpider can learn higher-order combinatorial patterns from multiple ChIP-Seq profiles. We have applied SignalSpider on the normalized ChIP-Seq profiles from the ENCODE consortium and learned model instances. We observed different higher-order enrichment and depletion patterns across sets of proteins. Those clustering patterns are supported by Gene Ontology (GO) enrichment, evolutionary conservation and chromatin interaction enrichment, offering biological insights for further focused studies. We also proposed a specific enrichment map visualization method to reveal the genome-wide transcription factor combinatorial patterns from the models built, which extend our existing fine-scale knowledge on gene regulation to a genome-wide level.
The matrix-algebra-optimized executables and source codes are available at the authors' websites: http://www.cs.toronto.edu/∼wkc/SignalSpider.
染色质免疫沉淀(ChIP)结合高通量测序(ChIP-Seq)可在体内测量转录因子在全基因组范围内的占有率。DNA结合蛋白占有率的不同组合可能导致一个基因在不同组织或不同发育阶段表达。为了全面了解基因的功能,开发基于多个ChIP-Seq图谱的概率模型以解读多个转录因子的组合调控机制至关重要。
在这项工作中,我们描述了一种概率模型(SignalSpider)来解读多个转录因子的组合结合事件。与现有的类似方法相比,我们发现SignalSpider在聚类启动子和增强子区域方面表现更好。值得注意的是,SignalSpider可以从多个ChIP-Seq图谱中学习高阶组合模式。我们已将SignalSpider应用于来自ENCODE联盟的标准化ChIP-Seq图谱并学习了模型实例。我们观察到不同蛋白质组之间存在不同的高阶富集和缺失模式。这些聚类模式得到了基因本体论(GO)富集、进化保守性和染色质相互作用富集的支持,为进一步的重点研究提供了生物学见解。我们还提出了一种特定的富集图谱可视化方法,以从构建的模型中揭示全基因组范围内的转录因子组合模式,这将我们现有的关于基因调控的精细尺度知识扩展到了全基因组水平。
经过矩阵代数优化的可执行文件和源代码可在作者网站获取:http://www.cs.toronto.edu/∼wkc/SignalSpider 。