Suppr超能文献

基于信息论的DNA结合位点识别方法的重新审视。

A reexamination of information theory-based methods for DNA-binding site identification.

作者信息

Erill Ivan, O'Neill Michael C

机构信息

Department of Biological Sciences, University of Maryland-Baltimore County, Baltimore, MD, USA.

出版信息

BMC Bioinformatics. 2009 Feb 11;10:57. doi: 10.1186/1471-2105-10-57.

Abstract

BACKGROUND

Searching for transcription factor binding sites in genome sequences is still an open problem in bioinformatics. Despite substantial progress, search methods based on information theory remain a standard in the field, even though the full validity of their underlying assumptions has only been tested in artificial settings. Here we use newly available data on transcription factors from different bacterial genomes to make a more thorough assessment of information theory-based search methods.

RESULTS

Our results reveal that conventional benchmarking against artificial sequence data leads frequently to overestimation of search efficiency. In addition, we find that sequence information by itself is often inadequate and therefore must be complemented by other cues, such as curvature, in real genomes. Furthermore, results on skewed genomes show that methods integrating skew information, such as Relative Entropy, are not effective because their assumptions may not hold in real genomes. The evidence suggests that binding sites tend to evolve towards genomic skew, rather than against it, and to maintain their information content through increased conservation. Based on these results, we identify several misconceptions on information theory as applied to binding sites, such as negative entropy, and we propose a revised paradigm to explain the observed results.

CONCLUSION

We conclude that, among information theory-based methods, the most unassuming search methods perform, on average, better than any other alternatives, since heuristic corrections to these methods are prone to fail when working on real data. A reexamination of information content in binding sites reveals that information content is a compound measure of search and binding affinity requirements, a fact that has important repercussions for our understanding of binding site evolution.

摘要

背景

在基因组序列中搜索转录因子结合位点仍然是生物信息学中的一个未解决问题。尽管取得了重大进展,但基于信息论的搜索方法仍是该领域的标准方法,即使其基本假设的完全有效性仅在人工环境中得到检验。在此,我们使用来自不同细菌基因组的转录因子新可得数据,对基于信息论的搜索方法进行更全面的评估。

结果

我们的结果表明,针对人工序列数据的传统基准测试常常导致对搜索效率的高估。此外,我们发现序列信息本身往往并不充分,因此在真实基因组中必须辅之以其他线索,如曲率。此外,对偏斜基因组的研究结果表明,整合偏斜信息的方法,如相对熵,并不有效,因为其假设在真实基因组中可能不成立。有证据表明,结合位点倾向于朝着基因组偏斜方向进化,而非与之相反,并通过增加保守性来维持其信息含量。基于这些结果,我们识别出了应用于结合位点的信息论中的几个误解,如负熵,并提出了一个修正的范式来解释观察到的结果。

结论

我们得出结论,在基于信息论的方法中,最不做假设的搜索方法平均表现优于任何其他方法,因为对这些方法的启发式修正应用于真实数据时容易失败。对结合位点信息含量的重新审视表明,信息含量是搜索和结合亲和力要求的复合度量,这一事实对我们理解结合位点进化具有重要影响。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/63d0/2680408/e3e9ba153c62/1471-2105-10-57-1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验