Suppr超能文献

从ChIP-seq数据推断DNA结合位点的基序内依赖性。

Inferring intra-motif dependencies of DNA binding sites from ChIP-seq data.

作者信息

Eggeling Ralf, Roos Teemu, Myllymäki Petri, Grosse Ivo

机构信息

Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle, Germany.

Department of Computer Science, Helsinki Institute for Information Technology HIIT, University of Helsinki, Helsinki, Finland.

出版信息

BMC Bioinformatics. 2015 Nov 9;16:375. doi: 10.1186/s12859-015-0797-4.

Abstract

BACKGROUND

Statistical modeling of transcription factor binding sites is one of the classical fields in bioinformatics. The position weight matrix (PWM) model, which assumes statistical independence among all nucleotides in a binding site, used to be the standard model for this task for more than three decades but its simple assumptions are increasingly put into question. Recent high-throughput sequencing methods have provided data sets of sufficient size and quality for studying the benefits of more complex models. However, learning more complex models typically entails the danger of overfitting, and while model classes that dynamically adapt the model complexity to data have been developed, effective model selection is to date only possible for fully observable data, but not, e.g., within de novo motif discovery.

RESULTS

To address this issue, we propose a stochastic algorithm for performing robust model selection in a latent variable setting. This algorithm yields a solution without relying on hyperparameter-tuning via massive cross-validation or other computationally expensive resampling techniques. Using this algorithm for learning inhomogeneous parsimonious Markov models, we study the degree of putative higher-order intra-motif dependencies for transcription factor binding sites inferred via de novo motif discovery from ChIP-seq data. We find that intra-motif dependencies are prevalent and not limited to first-order dependencies among directly adjacent nucleotides, but that second-order models appear to be the significantly better choice.

CONCLUSIONS

The traditional PWM model appears to be indeed insufficient to infer realistic sequence motifs, as it is on average outperformed by more complex models that take into account intra-motif dependencies. Moreover, using such models together with an appropriate model selection procedure does not lead to a significant performance loss in comparison with the PWM model for any of the studied transcription factors. Hence, we find it worthwhile to recommend that any modern motif discovery algorithm should attempt to take into account intra-motif dependencies.

摘要

背景

转录因子结合位点的统计建模是生物信息学中的经典领域之一。位置权重矩阵(PWM)模型假定结合位点中所有核苷酸之间具有统计独立性,三十多年来一直是该任务的标准模型,但其简单假设日益受到质疑。最近的高通量测序方法提供了足够规模和质量的数据集,用于研究更复杂模型的优势。然而,学习更复杂的模型通常存在过拟合的风险,虽然已经开发出能使模型复杂度根据数据动态调整的模型类别,但到目前为止,有效的模型选择仅适用于完全可观测的数据,例如在从头基序发现中则无法实现。

结果

为解决此问题,我们提出一种随机算法,用于在潜在变量设置中进行稳健的模型选择。该算法无需通过大规模交叉验证或其他计算成本高昂的重采样技术进行超参数调整即可得出解决方案。使用此算法学习非均匀简约马尔可夫模型,我们研究了通过从ChIP-seq数据进行从头基序发现推断出的转录因子结合位点假定的高阶基序内依赖性程度。我们发现基序内依赖性普遍存在,且不限于直接相邻核苷酸之间的一阶依赖性,二阶模型似乎是明显更好的选择。

结论

传统的PWM模型似乎确实不足以推断出现实的序列基序,因为考虑基序内依赖性的更复杂模型平均表现优于它。此外,与PWM模型相比,对于任何所研究的转录因子,使用此类模型并结合适当的模型选择程序不会导致显著的性能损失。因此,我们认为值得推荐任何现代基序发现算法都应尝试考虑基序内依赖性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e9cd/4640111/694d53ba0845/12859_2015_797_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验