Zuo Chandler, Chen Kailei, Keleş Sündüz
Department of Statistics, Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison , Madison, Wisconsin.
J Comput Biol. 2017 Jun;24(6):472-485. doi: 10.1089/cmb.2016.0138. Epub 2016 Nov 11.
Current analytic approaches for querying large collections of chromatin immunoprecipitation followed by sequencing (ChIP-seq) data from multiple cell types rely on individual analysis of each data set (i.e., peak calling) independently. This approach discards the fact that functional elements are frequently shared among related cell types and leads to overestimation of the extent of divergence between different ChIP-seq samples. Methods geared toward multisample investigations have limited applicability in settings that aim to integrate 100s to 1000s of ChIP-seq data sets for query loci (e.g., thousands of genomic loci with a specific binding site). Recently, Zuo et al. developed a hierarchical framework for state-space matrix inference and clustering, named MBASIC, to enable joint analysis of user-specified loci across multiple ChIP-seq data sets. Although this versatile framework estimates both the underlying state-space (e.g., bound vs. unbound) and also groups loci with similar patterns together, its Expectation-Maximization-based estimation structure hinders its applicability with large number of loci and samples. We address this limitation by developing MAP-based asymptotic derivations from Bayes (MAD-Bayes) framework for MBASIC. This results in a K-means-like optimization algorithm that converges rapidly and hence enables exploring multiple initialization schemes and flexibility in tuning. Comparison with MBASIC indicates that this speed comes at a relatively insignificant loss in estimation accuracy. Although MAD-Bayes MBASIC is specifically designed for the analysis of user-specified loci, it is able to capture overall patterns of histone marks from multiple ChIP-seq data sets similar to those identified by genome-wide segmentation methods such as ChromHMM and Spectacle.
当前用于查询来自多种细胞类型的大量染色质免疫沉淀测序(ChIP-seq)数据的分析方法依赖于对每个数据集进行独立分析(即峰检测)。这种方法忽略了功能元件在相关细胞类型中经常共享这一事实,并导致对不同ChIP-seq样本之间差异程度的高估。针对多样本研究的方法在旨在整合数百到数千个查询位点的ChIP-seq数据集(例如,具有特定结合位点的数千个基因组位点)的情况下适用性有限。最近,左等人开发了一种用于状态空间矩阵推断和聚类的分层框架,名为MBASIC,以实现对多个ChIP-seq数据集的用户指定位点进行联合分析。尽管这个通用框架既估计潜在的状态空间(例如,结合与未结合),又将具有相似模式的位点分组在一起,但其基于期望最大化的估计结构阻碍了它在大量位点和样本中的适用性。我们通过为MBASIC开发基于贝叶斯的最大后验概率渐近推导(MAD-Bayes)框架来解决这一限制。这产生了一种类似K均值的优化算法,该算法收敛迅速,因此能够探索多种初始化方案并在调整方面具有灵活性。与MBASIC的比较表明,这种速度是以估计精度的相对较小损失为代价的。尽管MAD-Bayes MBASIC是专门为分析用户指定位点而设计的,但它能够从多个ChIP-seq数据集中捕获组蛋白标记的总体模式,类似于通过全基因组分割方法(如ChromHMM和Spectacle)所识别的模式。