Suppr超能文献

基于蒙特卡罗的框架增强了调控序列基序的发现和解释。

A Monte Carlo-based framework enhances the discovery and interpretation of regulatory sequence motifs.

机构信息

Department of Biomedical Engineering, One Shields Ave, University of California, Davis, CA 95616, USA.

出版信息

BMC Bioinformatics. 2012 Nov 27;13:317. doi: 10.1186/1471-2105-13-317.

Abstract

BACKGROUND

Discovery of functionally significant short, statistically overrepresented subsequence patterns (motifs) in a set of sequences is a challenging problem in bioinformatics. Oftentimes, not all sequences in the set contain a motif. These non-motif-containing sequences complicate the algorithmic discovery of motifs. Filtering the non-motif-containing sequences from the larger set of sequences while simultaneously determining the identity of the motif is, therefore, desirable and a non-trivial problem in motif discovery research.

RESULTS

We describe MotifCatcher, a framework that extends the sensitivity of existing motif-finding tools by employing random sampling to effectively remove non-motif-containing sequences from the motif search. We developed two implementations of our algorithm; each built around a commonly used motif-finding tool, and applied our algorithm to three diverse chromatin immunoprecipitation (ChIP) data sets. In each case, the motif finder with the MotifCatcher extension demonstrated improved sensitivity over the motif finder alone. Our approach organizes candidate functionally significant discovered motifs into a tree, which allowed us to make additional insights. In all cases, we were able to support our findings with experimental work from the literature.

CONCLUSIONS

Our framework demonstrates that additional processing at the sequence entry level can significantly improve the performance of existing motif-finding tools. For each biological data set tested, we were able to propose novel biological hypotheses supported by experimental work from the literature. Specifically, in Escherichia coli, we suggested binding site motifs for 6 non-traditional LexA protein binding sites; in Saccharomyces cerevisiae, we hypothesize 2 disparate mechanisms for novel binding sites of the Cse4p protein; and in Halobacterium sp. NRC-1, we discoverd subtle differences in a general transcription factor (GTF) binding site motif across several data sets. We suggest that small differences in our discovered motif could confer specificity for one or more homologous GTF proteins. We offer a free implementation of the MotifCatcher software package at http://www.bme.ucdavis.edu/facciotti/resources_data/software/.

摘要

背景

在一组序列中发现功能上重要的短且统计上过度表示的子序列模式(基序)是生物信息学中的一个具有挑战性的问题。通常情况下,集合中的并非所有序列都包含基序。这些不包含基序的序列使基序的算法发现变得复杂。因此,从较大的序列集合中筛选不包含基序的序列,同时确定基序的身份是 motif 发现研究中的一个理想且非平凡的问题。

结果

我们描述了 MotifCatcher,它是一个通过随机抽样来有效去除 motif 搜索中非包含基序序列的框架,从而扩展了现有 motif 查找工具的灵敏度。我们开发了两种算法实现,它们分别围绕常用的 motif 查找工具构建,并将我们的算法应用于三个不同的染色质免疫沉淀(ChIP)数据集。在每种情况下,带有 MotifCatcher 扩展的 motif 查找器都比单独的 motif 查找器表现出更高的灵敏度。我们的方法将候选功能显著的已发现基序组织成一棵树,这使我们能够做出更多的见解。在所有情况下,我们都能够用文献中的实验工作来支持我们的发现。

结论

我们的框架表明,在序列输入级别进行额外的处理可以显著提高现有 motif 查找工具的性能。对于每个测试的生物数据集,我们都能够提出新的生物学假设,并得到文献中实验工作的支持。具体来说,在大肠杆菌中,我们提出了 6 个非传统 LexA 蛋白结合位点的结合位点基序;在酿酒酵母中,我们假设了 Cse4p 蛋白新结合位点的 2 种不同机制;在 Halobacterium sp. NRC-1 中,我们在几个数据集发现了一般转录因子(GTF)结合位点基序的细微差异。我们认为,我们发现的基序中的细微差异可能为一个或多个同源 GTF 蛋白提供特异性。我们在 http://www.bme.ucdavis.edu/facciotti/resources_data/software/ 上提供了 MotifCatcher 软件包的免费实现。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ab5/3542263/ddb71cf17993/1471-2105-13-317-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验