Wei Yingying, Wu George, Ji Hongkai
Department of Biostatistics, The Johns Hopkins Bloomberg School of Public Health, 615 North Wolfe Street, Baltimore, MD 21205 USA.
Stat Biosci. 2013 May;5(1):156-178. doi: 10.1007/s12561-012-9066-5. Epub 2012 May 23.
Mapping genome-wide binding sites of all transcription factors (TFs) in all biological contexts is a critical step toward understanding gene regulation. The state-of-the-art technologies for mapping transcription factor binding sites (TFBSs) couple chromatin immunoprecipitation (ChIP) with high-throughput sequencing (ChIP-seq) or tiling array hybridization (ChIP-chip). These technologies have limitations: they are low-throughput with respect to surveying many TFs. Recent advances in genome-wide chromatin profiling, including development of technologies such as DNase-seq, FAIRE-seq and ChIP-seq for histone modifications, make it possible to predict in vivo TFBSs by analyzing chromatin features at computationally determined DNA motif sites. This promising new approach may allow researchers to monitor the genome-wide binding sites of many TFs simultaneously. In this article, we discuss various experimental design and data analysis issues that arise when applying this approach. Through a systematic analysis of the data from the Encyclopedia Of DNA Elements (ENCODE) project, we compare the predictive power of individual and combinations of chromatin marks using supervised and unsupervised learning methods, and evaluate the value of integrating information from public ChIP and gene expression data. We also highlight the challenges and opportunities for developing novel analytical methods, such as resolving the one-motif-multiple-TF ambiguity and distinguishing functional and non-functional TF binding targets from the predicted binding sites.
The online version of this article (doi:10.1007/s12561-012-9066-5) contains supplementary material, which is available to authorized users.
绘制所有转录因子(TF)在所有生物学背景下的全基因组结合位点,是迈向理解基因调控的关键一步。用于绘制转录因子结合位点(TFBS)的最先进技术,是将染色质免疫沉淀(ChIP)与高通量测序(ChIP-seq)或平铺阵列杂交(ChIP-chip)相结合。这些技术存在局限性:在检测多个TF方面通量较低。全基因组染色质分析的最新进展,包括DNase-seq、FAIRE-seq以及用于组蛋白修饰的ChIP-seq等技术的发展,使得通过分析计算确定的DNA基序位点处的染色质特征来预测体内TFBS成为可能。这种有前景的新方法或许能让研究人员同时监测多个TF的全基因组结合位点。在本文中,我们讨论了应用此方法时出现的各种实验设计和数据分析问题。通过对DNA元件百科全书(ENCODE)项目数据的系统分析,我们使用监督和无监督学习方法比较了单个染色质标记及染色质标记组合的预测能力,并评估了整合来自公共ChIP和基因表达数据信息的价值。我们还强调了开发新型分析方法所面临的挑战和机遇,比如解决单基序多TF的模糊性问题,以及从预测的结合位点中区分功能性和非功能性TF结合靶点。
本文的在线版本(doi:10.1007/s12561-012-9066-5)包含补充材料,授权用户可获取。