Xing Haipeng, Liao Willey, Mo Yifan, Zhang Michael Q
Department of Applied Mathematics & Statistics, Stony Brook University.
J Vis Exp. 2012 Dec 10(70):e4273. doi: 10.3791/4273.
ChIPseq is a widely used technique for investigating protein-DNA interactions. Read density profiles are generated by using next-sequencing of protein-bound DNA and aligning the short reads to a reference genome. Enriched regions are revealed as peaks, which often differ dramatically in shape, depending on the target protein(1). For example, transcription factors often bind in a site- and sequence-specific manner and tend to produce punctate peaks, while histone modifications are more pervasive and are characterized by broad, diffuse islands of enrichment(2). Reliably identifying these regions was the focus of our work. Algorithms for analyzing ChIPseq data have employed various methodologies, from heuristics(3-5) to more rigorous statistical models, e.g. Hidden Markov Models (HMMs)(6-8). We sought a solution that minimized the necessity for difficult-to-define, ad hoc parameters that often compromise resolution and lessen the intuitive usability of the tool. With respect to HMM-based methods, we aimed to curtail parameter estimation procedures and simple, finite state classifications that are often utilized. Additionally, conventional ChIPseq data analysis involves categorization of the expected read density profiles as either punctate or diffuse followed by subsequent application of the appropriate tool. We further aimed to replace the need for these two distinct models with a single, more versatile model, which can capably address the entire spectrum of data types. To meet these objectives, we first constructed a statistical framework that naturally modeled ChIPseq data structures using a cutting edge advance in HMMs(9), which utilizes only explicit formulas-an innovation crucial to its performance advantages. More sophisticated then heuristic models, our HMM accommodates infinite hidden states through a Bayesian model. We applied it to identifying reasonable change points in read density, which further define segments of enrichment. Our analysis revealed how our Bayesian Change Point (BCP) algorithm had a reduced computational complexity-evidenced by an abridged run time and memory footprint. The BCP algorithm was successfully applied to both punctate peak and diffuse island identification with robust accuracy and limited user-defined parameters. This illustrated both its versatility and ease of use. Consequently, we believe it can be implemented readily across broad ranges of data types and end users in a manner that is easily compared and contrasted, making it a great tool for ChIPseq data analysis that can aid in collaboration and corroboration between research groups. Here, we demonstrate the application of BCP to existing transcription factor(10,11) and epigenetic data(12) to illustrate its usefulness.
染色质免疫沉淀测序(ChIPseq)是一种广泛用于研究蛋白质 - DNA相互作用的技术。通过对与蛋白质结合的DNA进行二代测序,并将短读段与参考基因组比对,生成读段密度分布图。富集区域显示为峰,其形状通常差异很大,这取决于目标蛋白质(1)。例如,转录因子通常以位点和序列特异性方式结合,倾向于产生点状峰,而组蛋白修饰则更为普遍,其特征是广泛、弥散的富集岛(2)。可靠地识别这些区域是我们工作的重点。分析ChIPseq数据的算法采用了各种方法,从启发式方法(3 - 5)到更严格的统计模型,如隐马尔可夫模型(HMMs)(6 - 8)。我们寻求一种解决方案,尽量减少定义困难的临时参数的必要性,这些参数常常会影响分辨率并降低工具的直观易用性。对于基于HMM的方法,我们旨在简化通常使用的参数估计程序和简单的有限状态分类。此外,传统的ChIPseq数据分析涉及将预期的读段密度分布图分类为点状或弥散状,然后应用相应的工具。我们还旨在用一个更通用的单一模型取代这两种不同模型的需求,该模型能够处理整个数据类型范围。为了实现这些目标,我们首先构建了一个统计框架,该框架使用HMMs的前沿进展(9)自然地对ChIPseq数据结构进行建模,该进展仅使用显式公式——这一创新对其性能优势至关重要。我们的HMM比启发式模型更复杂,它通过贝叶斯模型容纳无限个隐藏状态。我们将其应用于识别读段密度中的合理变化点,这些变化点进一步定义了富集片段。我们的分析表明,我们的贝叶斯变化点(BCP)算法具有降低的计算复杂度——运行时间缩短和内存占用减少证明了这一点。BCP算法已成功应用于点状峰和弥散岛的识别,具有较高的准确性且用户定义参数有限。这说明了它的通用性和易用性。因此,我们相信它可以很容易地应用于广泛的数据类型和终端用户,并且易于比较和对比,使其成为ChIPseq数据分析的一个很好的工具,有助于研究团队之间的合作和验证。在这里,我们展示BCP在现有转录因子(10,11)和表观遗传数据(12)上的应用,以说明其有用性。