THiCweed：通过对大数据集进行聚类实现快速、灵敏的序列特征检测。

THiCweed: fast, sensitive detection of sequence features by clustering big datasets.

机构信息

Computational Biology Group, The Institute of Mathematical Sciences (HBNI), Chennai 600113, Tamil Nadu, India.

Chemical Engineering and Process Development Division, CSIR-National Chemical Laboratory, Pune 411008, Maharashtra, India.

出版信息

Nucleic Acids Res. 2018 Mar 16;46(5):e29. doi: 10.1093/nar/gkx1251.

DOI:10.1093/nar/gkx1251

PMID:29267972

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5861420/

Abstract

We present THiCweed, a new approach to analyzing transcription factor binding data from high-throughput chromatin immunoprecipitation-sequencing (ChIP-seq) experiments. THiCweed clusters bound regions based on sequence similarity using a divisive hierarchical clustering approach based on sequence similarity within sliding windows, while exploring both strands. ThiCweed is specially geared toward data containing mixtures of motifs, which present a challenge to traditional motif-finders. Our implementation is significantly faster than standard motif-finding programs, able to process 30 000 peaks in 1-2 h, on a single CPU core of a desktop computer. On synthetic data containing mixtures of motifs it is as accurate or more accurate than all other tested programs. THiCweed performs best with large 'window' sizes (≥50 bp), much longer than typical binding sites (7-15 bp). On real data it successfully recovers literature motifs, but also uncovers complex sequence characteristics in flanking DNA, variant motifs and secondary motifs even when they occur in <5% of the input, all of which appear biologically relevant. We also find recurring sequence patterns across diverse ChIP-seq datasets, possibly related to chromatin architecture and looping. THiCweed thus goes beyond traditional motif finding to give new insights into genomic transcription factor-binding complexity.

摘要

我们提出了 THiCweed，这是一种新的方法，可以分析来自高通量染色质免疫沉淀测序（ChIP-seq）实验的转录因子结合数据。THiCweed 基于序列相似性使用基于滑动窗口内序列相似性的分裂层次聚类方法对结合区域进行聚类，同时探索两条链。ThiCweed 特别针对包含混合基序的数据，这对传统的基序发现程序提出了挑战。我们的实现比标准的基序发现程序快得多，能够在单个桌面计算机的 CPU 内核上在 1-2 小时内处理 30000 个峰。在包含混合基序的合成数据上，它与所有其他测试程序一样准确或更准确。THiCweed 在使用较大的“窗口”大小（≥50 bp）时表现最佳，比典型的结合位点（7-15 bp）长得多。在真实数据上，它成功地恢复了文献基序，但也揭示了侧翼 DNA 中的复杂序列特征、变体基序和二级基序，即使它们仅出现在输入的<5%中，所有这些都似乎具有生物学相关性。我们还在不同的 ChIP-seq 数据集上发现了重复的序列模式，可能与染色质结构和环化有关。因此，THiCweed 超越了传统的基序发现，为基因组转录因子结合的复杂性提供了新的见解。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cda1/5861420/5137110496a7/gkx1251fig1.jpg

相似文献

THiCweed: fast, sensitive detection of sequence features by clustering big datasets.THiCweed：通过对大数据集进行聚类实现快速、灵敏的序列特征检测。

Nucleic Acids Res. 2018 Mar 16;46(5):e29. doi: 10.1093/nar/gkx1251.

Inferring intra-motif dependencies of DNA binding sites from ChIP-seq data.从ChIP-seq数据推断DNA结合位点的基序内依赖性。

BMC Bioinformatics. 2015 Nov 9;16:375. doi: 10.1186/s12859-015-0797-4.

RSAT::Plants: Motif Discovery in ChIP-Seq Peaks of Plant Genomes.RSAT::植物：植物基因组ChIP-Seq峰中的基序发现

Methods Mol Biol. 2016;1482:297-322. doi: 10.1007/978-1-4939-6396-6_19.

Motif-based analysis of large nucleotide data sets using MEME-ChIP.使用MEME-ChIP对大型核苷酸数据集进行基于模体的分析。

Nat Protoc. 2014;9(6):1428-50. doi: 10.1038/nprot.2014.083. Epub 2014 May 22.

DIVERSITY in binding, regulation, and evolution revealed from high-throughput ChIP.高通量 ChIP 揭示的结合、调控和进化多样性。

PLoS Comput Biol. 2018 Apr 23;14(4):e1006090. doi: 10.1371/journal.pcbi.1006090. eCollection 2018 Apr.

An Efficient Algorithm for Discovering Motifs in Large DNA Data Sets.一种在大型DNA数据集中发现基序的高效算法。

IEEE Trans Nanobioscience. 2015 Jul;14(5):535-44. doi: 10.1109/TNB.2015.2421340. Epub 2015 Apr 9.

Using combined evidence from replicates to evaluate ChIP-seq peaks.使用来自重复样本的综合证据评估染色质免疫沉淀测序（ChIP-seq）峰。

Bioinformatics. 2015 Sep 1;31(17):2761-9. doi: 10.1093/bioinformatics/btv293. Epub 2015 May 7.

WSMD: weakly-supervised motif discovery in transcription factor ChIP-seq data.WSMD：在转录因子 ChIP-seq 数据中进行弱监督基序发现。

Sci Rep. 2017 Jun 12;7(1):3217. doi: 10.1038/s41598-017-03554-7.

A Fast Cluster Motif Finding Algorithm for ChIP-Seq Data Sets.一种用于ChIP-Seq数据集的快速聚类基序发现算法。

Biomed Res Int. 2015;2015:218068. doi: 10.1155/2015/218068. Epub 2015 Jul 5.

Unified Analysis of Multiple ChIP-Seq Datasets.多个 ChIP-Seq 数据集的统一分析。

Methods Mol Biol. 2021;2198:451-465. doi: 10.1007/978-1-0716-0876-0_33.

引用本文的文献

Position-specific evolution in transcription factor binding sites, and a fast likelihood calculation for the F81 model.转录因子结合位点的位置特异性进化以及F81模型的快速似然计算。

R Soc Open Sci. 2024 Jan 24;11(1):231088. doi: 10.1098/rsos.231088. eCollection 2024 Jan.

A universal framework for detecting -regulatory diversity in DNA regions.用于检测 DNA 区域中的调控多样性的通用框架。

Genome Res. 2021 Sep;31(9):1646-1662. doi: 10.1101/gr.274563.120. Epub 2021 Jul 19.

Disentangling transcription factor binding site complexity.解析转录因子结合位点的复杂性。

Nucleic Acids Res. 2018 Nov 16;46(20):e121. doi: 10.1093/nar/gky683.

本文引用的文献

JASPAR 2016: a major expansion and update of the open-access database of transcription factor binding profiles.JASPAR 2016：转录因子结合谱开放获取数据库的重大扩展与更新

Nucleic Acids Res. 2016 Jan 4;44(D1):D110-5. doi: 10.1093/nar/gkv1176. Epub 2015 Nov 3.

No Promoter Left Behind (NPLB): learn de novo promoter architectures from genome-wide transcription start sites.一个都不落下的启动子（NPLB）：从全基因组转录起始位点学习全新的启动子结构。

Bioinformatics. 2016 Mar 1;32(5):779-81. doi: 10.1093/bioinformatics/btv645. Epub 2015 Nov 2.

ENCODE data at the ENCODE portal.ENCODE门户中的ENCODE数据。

Nucleic Acids Res. 2016 Jan 4;44(D1):D726-32. doi: 10.1093/nar/gkv1160. Epub 2015 Nov 2.

ChIP-nexus enables improved detection of in vivo transcription factor binding footprints.染色质免疫沉淀结合高通量测序技术（ChIP-nexus）能够更好地检测体内转录因子结合足迹。

Nat Biotechnol. 2015 Apr;33(4):395-401. doi: 10.1038/nbt.3121. Epub 2015 Mar 9.

Multiple novel promoter-architectures revealed by decoding the hidden heterogeneity within the genome.通过解码基因组内隐藏的异质性揭示了多种新型启动子结构。

Nucleic Acids Res. 2014 Nov 10;42(20):12388-403. doi: 10.1093/nar/gku924. Epub 2014 Oct 17.

Using Weeder, Pscan, and PscanChIP for the Discovery of Enriched Transcription Factor Binding Site Motifs in Nucleotide Sequences.使用Weeder、Pscan和PscanChIP在核苷酸序列中发现富集的转录因子结合位点基序。

Curr Protoc Bioinformatics. 2014 Sep 8;47:2.11.1-31. doi: 10.1002/0471250953.bi0211s47.

Non-targeted transcription factors motifs are a systemic component of ChIP-seq datasets.非靶向转录因子基序是ChIP-seq数据集的一个系统组成部分。

Genome Biol. 2014 Jul 29;15(7):412. doi: 10.1186/s13059-014-0412-4.

Application of experimentally verified transcription factor binding sites models for computational analysis of ChIP-Seq data.经实验验证的转录因子结合位点模型在ChIP-Seq数据计算分析中的应用。

BMC Genomics. 2014 Jan 29;15(1):80. doi: 10.1186/1471-2164-15-80.

From binding motifs in ChIP-Seq data to improved models of transcription factor binding sites.从染色质免疫沉淀测序（ChIP-Seq）数据中的结合基序到转录因子结合位点的改进模型

J Bioinform Comput Biol. 2013 Feb;11(1):1340004. doi: 10.1142/S0219720013400040. Epub 2013 Jan 16.

Factorbook.org: a Wiki-based database for transcription factor-binding data generated by the ENCODE consortium.Factorbook.org：一个基于维基的转录因子结合数据数据库，由 ENCODE 联盟生成。

Nucleic Acids Res. 2013 Jan;41(Database issue):D171-6. doi: 10.1093/nar/gks1221. Epub 2012 Nov 29.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

THiCweed：通过对大数据集进行聚类实现快速、灵敏的序列特征检测。

THiCweed: fast, sensitive detection of sequence features by clustering big datasets.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献