Computational Biology Group, The Institute of Mathematical Sciences (HBNI), Chennai 600113, Tamil Nadu, India.
Chemical Engineering and Process Development Division, CSIR-National Chemical Laboratory, Pune 411008, Maharashtra, India.
Nucleic Acids Res. 2018 Mar 16;46(5):e29. doi: 10.1093/nar/gkx1251.
We present THiCweed, a new approach to analyzing transcription factor binding data from high-throughput chromatin immunoprecipitation-sequencing (ChIP-seq) experiments. THiCweed clusters bound regions based on sequence similarity using a divisive hierarchical clustering approach based on sequence similarity within sliding windows, while exploring both strands. ThiCweed is specially geared toward data containing mixtures of motifs, which present a challenge to traditional motif-finders. Our implementation is significantly faster than standard motif-finding programs, able to process 30 000 peaks in 1-2 h, on a single CPU core of a desktop computer. On synthetic data containing mixtures of motifs it is as accurate or more accurate than all other tested programs. THiCweed performs best with large 'window' sizes (≥50 bp), much longer than typical binding sites (7-15 bp). On real data it successfully recovers literature motifs, but also uncovers complex sequence characteristics in flanking DNA, variant motifs and secondary motifs even when they occur in <5% of the input, all of which appear biologically relevant. We also find recurring sequence patterns across diverse ChIP-seq datasets, possibly related to chromatin architecture and looping. THiCweed thus goes beyond traditional motif finding to give new insights into genomic transcription factor-binding complexity.
我们提出了 THiCweed,这是一种新的方法,可以分析来自高通量染色质免疫沉淀测序(ChIP-seq)实验的转录因子结合数据。THiCweed 基于序列相似性使用基于滑动窗口内序列相似性的分裂层次聚类方法对结合区域进行聚类,同时探索两条链。ThiCweed 特别针对包含混合基序的数据,这对传统的基序发现程序提出了挑战。我们的实现比标准的基序发现程序快得多,能够在单个桌面计算机的 CPU 内核上在 1-2 小时内处理 30000 个峰。在包含混合基序的合成数据上,它与所有其他测试程序一样准确或更准确。THiCweed 在使用较大的“窗口”大小(≥50 bp)时表现最佳,比典型的结合位点(7-15 bp)长得多。在真实数据上,它成功地恢复了文献基序,但也揭示了侧翼 DNA 中的复杂序列特征、变体基序和二级基序,即使它们仅出现在输入的<5%中,所有这些都似乎具有生物学相关性。我们还在不同的 ChIP-seq 数据集上发现了重复的序列模式,可能与染色质结构和环化有关。因此,THiCweed 超越了传统的基序发现,为基因组转录因子结合的复杂性提供了新的见解。