从ChIP-seq数据推断DNA结合位点的基序内依赖性。

Inferring intra-motif dependencies of DNA binding sites from ChIP-seq data.

作者信息

Eggeling Ralf, Roos Teemu, Myllymäki Petri, Grosse Ivo

机构信息

Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle, Germany.

Department of Computer Science, Helsinki Institute for Information Technology HIIT, University of Helsinki, Helsinki, Finland.

出版信息

BMC Bioinformatics. 2015 Nov 9;16:375. doi: 10.1186/s12859-015-0797-4.

DOI:10.1186/s12859-015-0797-4

PMID:26552868

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4640111/

Abstract

BACKGROUND

Statistical modeling of transcription factor binding sites is one of the classical fields in bioinformatics. The position weight matrix (PWM) model, which assumes statistical independence among all nucleotides in a binding site, used to be the standard model for this task for more than three decades but its simple assumptions are increasingly put into question. Recent high-throughput sequencing methods have provided data sets of sufficient size and quality for studying the benefits of more complex models. However, learning more complex models typically entails the danger of overfitting, and while model classes that dynamically adapt the model complexity to data have been developed, effective model selection is to date only possible for fully observable data, but not, e.g., within de novo motif discovery.

RESULTS

To address this issue, we propose a stochastic algorithm for performing robust model selection in a latent variable setting. This algorithm yields a solution without relying on hyperparameter-tuning via massive cross-validation or other computationally expensive resampling techniques. Using this algorithm for learning inhomogeneous parsimonious Markov models, we study the degree of putative higher-order intra-motif dependencies for transcription factor binding sites inferred via de novo motif discovery from ChIP-seq data. We find that intra-motif dependencies are prevalent and not limited to first-order dependencies among directly adjacent nucleotides, but that second-order models appear to be the significantly better choice.

CONCLUSIONS

The traditional PWM model appears to be indeed insufficient to infer realistic sequence motifs, as it is on average outperformed by more complex models that take into account intra-motif dependencies. Moreover, using such models together with an appropriate model selection procedure does not lead to a significant performance loss in comparison with the PWM model for any of the studied transcription factors. Hence, we find it worthwhile to recommend that any modern motif discovery algorithm should attempt to take into account intra-motif dependencies.

摘要

背景

转录因子结合位点的统计建模是生物信息学中的经典领域之一。位置权重矩阵（PWM）模型假定结合位点中所有核苷酸之间具有统计独立性，三十多年来一直是该任务的标准模型，但其简单假设日益受到质疑。最近的高通量测序方法提供了足够规模和质量的数据集，用于研究更复杂模型的优势。然而，学习更复杂的模型通常存在过拟合的风险，虽然已经开发出能使模型复杂度根据数据动态调整的模型类别，但到目前为止，有效的模型选择仅适用于完全可观测的数据，例如在从头基序发现中则无法实现。

结果

为解决此问题，我们提出一种随机算法，用于在潜在变量设置中进行稳健的模型选择。该算法无需通过大规模交叉验证或其他计算成本高昂的重采样技术进行超参数调整即可得出解决方案。使用此算法学习非均匀简约马尔可夫模型，我们研究了通过从ChIP-seq数据进行从头基序发现推断出的转录因子结合位点假定的高阶基序内依赖性程度。我们发现基序内依赖性普遍存在，且不限于直接相邻核苷酸之间的一阶依赖性，二阶模型似乎是明显更好的选择。

结论

传统的PWM模型似乎确实不足以推断出现实的序列基序，因为考虑基序内依赖性的更复杂模型平均表现优于它。此外，与PWM模型相比，对于任何所研究的转录因子，使用此类模型并结合适当的模型选择程序不会导致显著的性能损失。因此，我们认为值得推荐任何现代基序发现算法都应尝试考虑基序内依赖性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e9cd/4640111/694d53ba0845/12859_2015_797_Fig1_HTML.jpg

相似文献

Inferring intra-motif dependencies of DNA binding sites from ChIP-seq data.

BMC Bioinformatics. 2015 Nov 9;16:375. doi: 10.1186/s12859-015-0797-4.

Combining phylogenetic footprinting with motif models incorporating intra-motif dependencies.

BMC Bioinformatics. 2017 Mar 1;18(1):141. doi: 10.1186/s12859-017-1495-1.

On the value of intra-motif dependencies of human insulator protein CTCF.

PLoS One. 2014 Jan 22;9(1):e85629. doi: 10.1371/journal.pone.0085629. eCollection 2014.

Bayesian Markov models consistently outperform PWMs at predicting motifs in nucleotide sequences.

Nucleic Acids Res. 2016 Jul 27;44(13):6055-69. doi: 10.1093/nar/gkw521. Epub 2016 Jun 9.

Automated incorporation of pairwise dependency in transcription factor binding site prediction using dinucleotide weight tensors.

PLoS Comput Biol. 2017 Jul 28;13(7):e1005176. doi: 10.1371/journal.pcbi.1005176. eCollection 2017 Jul.

Simultaneously learning DNA motif along with its position and sequence rank preferences through expectation maximization algorithm.

J Comput Biol. 2013 Mar;20(3):237-48. doi: 10.1089/cmb.2012.0233.

Optimally choosing PWM motif databases and sequence scanning approaches based on ChIP-seq data.

BMC Bioinformatics. 2015 May 1;16:140. doi: 10.1186/s12859-015-0573-5.

A Fast Cluster Motif Finding Algorithm for ChIP-Seq Data Sets.

Biomed Res Int. 2015;2015:218068. doi: 10.1155/2015/218068. Epub 2015 Jul 5.

A general approach for discriminative de novo motif discovery from high-throughput data.

Nucleic Acids Res. 2013 Nov;41(21):e197. doi: 10.1093/nar/gkt831. Epub 2013 Sep 20.

InMoDe: tools for learning and visualizing intra-motif dependencies of DNA binding sites.

Bioinformatics. 2017 Feb 15;33(4):580-582. doi: 10.1093/bioinformatics/btw689.

引用本文的文献

Discovery of a non-canonical GRHL1 binding site using deep convolutional and recurrent neural networks.

BMC Genomics. 2023 Dec 4;24(1):736. doi: 10.1186/s12864-023-09830-3.

Transcription factor-binding k-mer analysis clarifies the cell type dependency of binding specificities and cis-regulatory SNPs in humans.

BMC Genomics. 2023 Oct 7;24(1):597. doi: 10.1186/s12864-023-09692-9.

Widespread effects of DNA methylation and intra-motif dependencies revealed by novel transcription factor binding models.

Nucleic Acids Res. 2023 Oct 13;51(18):e95. doi: 10.1093/nar/gkad693.

Systematic Evaluation of DNA Sequence Variations on Transcription Factor Binding Affinity.

Front Genet. 2021 Sep 9;12:667866. doi: 10.3389/fgene.2021.667866. eCollection 2021.

DNA-binding properties of the MADS-domain transcription factor SEPALLATA3 and mutant variants characterized by SELEX-seq.

Plant Mol Biol. 2021 Mar;105(4-5):543-557. doi: 10.1007/s11103-020-01108-6. Epub 2021 Jan 24.

TFBSshape: an expanded motif database for DNA shape features of transcription factor binding sites.

Nucleic Acids Res. 2020 Jan 8;48(D1):D246-D255. doi: 10.1093/nar/gkz970.

Modeling in-vivo protein-DNA binding by combining multiple-instance learning with a hybrid deep neural network.

Sci Rep. 2019 Jun 11;9(1):8484. doi: 10.1038/s41598-019-44966-x.

Allele specific chromatin signals, 3D interactions, and motif predictions for immune and B cell related diseases.

Sci Rep. 2019 Feb 25;9(1):2695. doi: 10.1038/s41598-019-39633-0.

A map of direct TF-DNA interactions in the human genome.

Nucleic Acids Res. 2019 Feb 28;47(4):e21. doi: 10.1093/nar/gky1210.

Disentangling transcription factor binding site complexity.

Nucleic Acids Res. 2018 Nov 16;46(20):e121. doi: 10.1093/nar/gky683.

本文引用的文献

Varying levels of complexity in transcription factor binding motifs.

Nucleic Acids Res. 2015 Oct 15;43(18):e119. doi: 10.1093/nar/gkv577. Epub 2015 Jun 26.

Absence of a simple code: how transcription factors read the genome.

Trends Biochem Sci. 2014 Sep;39(9):381-99. doi: 10.1016/j.tibs.2014.07.002. Epub 2014 Aug 14.

A survey of motif finding Web tools for detecting binding site motifs in ChIP-Seq data.

Biol Direct. 2014 Feb 20;9:4. doi: 10.1186/1745-6150-9-4.

On the value of intra-motif dependencies of human insulator protein CTCF.

PLoS One. 2014 Jan 22;9(1):e85629. doi: 10.1371/journal.pone.0085629. eCollection 2014.

TFBSshape: a motif database for DNA shape features of transcription factor binding sites.

Nucleic Acids Res. 2014 Jan;42(Database issue):D148-55. doi: 10.1093/nar/gkt1087. Epub 2013 Nov 7.

A general approach for discriminative de novo motif discovery from high-throughput data.

Nucleic Acids Res. 2013 Nov;41(21):e197. doi: 10.1093/nar/gkt831. Epub 2013 Sep 20.

The next generation of transcription factor binding site prediction.

PLoS Comput Biol. 2013;9(9):e1003214. doi: 10.1371/journal.pcbi.1003214. Epub 2013 Sep 5.

A genome-wide map of CTCF multivalency redefines the CTCF code.

Cell Rep. 2013 May 30;3(5):1678-1689. doi: 10.1016/j.celrep.2013.04.024. Epub 2013 May 23.

Improved models for transcription factor binding site identification using nonindependent interactions.

Genetics. 2012 Jul;191(3):781-90. doi: 10.1534/genetics.112.138685. Epub 2012 Apr 13.

Tree-based position weight matrix approach to model transcription factor binding site profiles.

PLoS One. 2011;6(9):e24210. doi: 10.1371/journal.pone.0024210. Epub 2011 Sep 2.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

从ChIP-seq数据推断DNA结合位点的基序内依赖性。

Inferring intra-motif dependencies of DNA binding sites from ChIP-seq data.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献