Suppr超能文献

马尔可夫源生成的一组随机序列中模式的精确分布:在生物数据中的应用。

Exact distribution of a pattern in a set of random sequences generated by a Markov source: applications to biological data.

作者信息

Nuel Gregory, Regad Leslie, Martin Juliette, Camproux Anne-Claude

机构信息

LSG, Laboratoire Statistique et Génome, CNRS UMR-8071, INRA UMR-1152, University of Evry, Evry, France.

CNRS, Paris, France.

出版信息

Algorithms Mol Biol. 2010 Jan 26;5:15. doi: 10.1186/1748-7188-5-15.

Abstract

BACKGROUND

In bioinformatics it is common to search for a pattern of interest in a potentially large set of rather short sequences (upstream gene regions, proteins, exons, etc.). Although many methodological approaches allow practitioners to compute the distribution of a pattern count in a random sequence generated by a Markov source, no specific developments have taken into account the counting of occurrences in a set of independent sequences. We aim to address this problem by deriving efficient approaches and algorithms to perform these computations both for low and high complexity patterns in the framework of homogeneous or heterogeneous Markov models.

RESULTS

The latest advances in the field allowed us to use a technique of optimal Markov chain embedding based on deterministic finite automata to introduce three innovative algorithms. Algorithm 1 is the only one able to deal with heterogeneous models. It also permits to avoid any product of convolution of the pattern distribution in individual sequences. When working with homogeneous models, Algorithm 2 yields a dramatic reduction in the complexity by taking advantage of previous computations to obtain moment generating functions efficiently. In the particular case of low or moderate complexity patterns, Algorithm 3 exploits power computation and binary decomposition to further reduce the time complexity to a logarithmic scale. All these algorithms and their relative interest in comparison with existing ones were then tested and discussed on a toy-example and three biological data sets: structural patterns in protein loop structures, PROSITE signatures in a bacterial proteome, and transcription factors in upstream gene regions. On these data sets, we also compared our exact approaches to the tempting approximation that consists in concatenating the sequences in the data set into a single sequence.

CONCLUSIONS

Our algorithms prove to be effective and able to handle real data sets with multiple sequences, as well as biological patterns of interest, even when the latter display a high complexity (PROSITE signatures for example). In addition, these exact algorithms allow us to avoid the edge effect observed under the single sequence approximation, which leads to erroneous results, especially when the marginal distribution of the model displays a slow convergence toward the stationary distribution. We end up with a discussion on our method and on its potential improvements.

摘要

背景

在生物信息学中,通常会在潜在的大量较短序列(上游基因区域、蛋白质、外显子等)中搜索感兴趣的模式。尽管许多方法允许从业者计算马尔可夫源生成的随机序列中模式计数的分布,但没有具体的进展考虑到一组独立序列中出现次数的计数。我们旨在通过推导高效的方法和算法来解决这个问题,以便在齐次或非齐次马尔可夫模型框架下对低复杂度和高复杂度模式进行这些计算。

结果

该领域的最新进展使我们能够使用基于确定性有限自动机的最优马尔可夫链嵌入技术引入三种创新算法。算法1是唯一能够处理非齐次模型的算法。它还允许避免单个序列中模式分布的卷积的任何乘积。在处理齐次模型时,算法2通过利用先前的计算来有效获得矩生成函数,从而显著降低了复杂度。在低复杂度或中等复杂度模式的特殊情况下,算法3利用幂运算和二进制分解将时间复杂度进一步降低到对数尺度。然后,在一个简单示例和三个生物数据集上对所有这些算法及其与现有算法相比的相对优势进行了测试和讨论:蛋白质环结构中的结构模式、细菌蛋白质组中的PROSITE签名以及上游基因区域中的转录因子。在这些数据集上,我们还将我们的精确方法与将数据集中的序列连接成单个序列的诱人近似方法进行了比较。

结论

我们的算法被证明是有效的,能够处理具有多个序列的真实数据集以及感兴趣的生物模式,即使后者显示出高复杂度(例如PROSITE签名)。此外,这些精确算法使我们能够避免在单序列近似下观察到的边缘效应,这种效应会导致错误结果,特别是当模型的边际分布向平稳分布的收敛缓慢时。最后,我们对我们的方法及其潜在改进进行了讨论。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc8d/2828453/44649feccbe1/1748-7188-5-15-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验