马尔可夫源生成的一组随机序列中模式的精确分布：在生物数据中的应用。

Exact distribution of a pattern in a set of random sequences generated by a Markov source: applications to biological data.

作者信息

Nuel Gregory, Regad Leslie, Martin Juliette, Camproux Anne-Claude

机构信息

LSG, Laboratoire Statistique et Génome, CNRS UMR-8071, INRA UMR-1152, University of Evry, Evry, France.

CNRS, Paris, France.

出版信息

Algorithms Mol Biol. 2010 Jan 26;5:15. doi: 10.1186/1748-7188-5-15.

DOI:10.1186/1748-7188-5-15

PMID:20205909

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2828453/

Abstract

BACKGROUND

In bioinformatics it is common to search for a pattern of interest in a potentially large set of rather short sequences (upstream gene regions, proteins, exons, etc.). Although many methodological approaches allow practitioners to compute the distribution of a pattern count in a random sequence generated by a Markov source, no specific developments have taken into account the counting of occurrences in a set of independent sequences. We aim to address this problem by deriving efficient approaches and algorithms to perform these computations both for low and high complexity patterns in the framework of homogeneous or heterogeneous Markov models.

RESULTS

The latest advances in the field allowed us to use a technique of optimal Markov chain embedding based on deterministic finite automata to introduce three innovative algorithms. Algorithm 1 is the only one able to deal with heterogeneous models. It also permits to avoid any product of convolution of the pattern distribution in individual sequences. When working with homogeneous models, Algorithm 2 yields a dramatic reduction in the complexity by taking advantage of previous computations to obtain moment generating functions efficiently. In the particular case of low or moderate complexity patterns, Algorithm 3 exploits power computation and binary decomposition to further reduce the time complexity to a logarithmic scale. All these algorithms and their relative interest in comparison with existing ones were then tested and discussed on a toy-example and three biological data sets: structural patterns in protein loop structures, PROSITE signatures in a bacterial proteome, and transcription factors in upstream gene regions. On these data sets, we also compared our exact approaches to the tempting approximation that consists in concatenating the sequences in the data set into a single sequence.

CONCLUSIONS

Our algorithms prove to be effective and able to handle real data sets with multiple sequences, as well as biological patterns of interest, even when the latter display a high complexity (PROSITE signatures for example). In addition, these exact algorithms allow us to avoid the edge effect observed under the single sequence approximation, which leads to erroneous results, especially when the marginal distribution of the model displays a slow convergence toward the stationary distribution. We end up with a discussion on our method and on its potential improvements.

摘要

背景

在生物信息学中，通常会在潜在的大量较短序列（上游基因区域、蛋白质、外显子等）中搜索感兴趣的模式。尽管许多方法允许从业者计算马尔可夫源生成的随机序列中模式计数的分布，但没有具体的进展考虑到一组独立序列中出现次数的计数。我们旨在通过推导高效的方法和算法来解决这个问题，以便在齐次或非齐次马尔可夫模型框架下对低复杂度和高复杂度模式进行这些计算。

结果

该领域的最新进展使我们能够使用基于确定性有限自动机的最优马尔可夫链嵌入技术引入三种创新算法。算法1是唯一能够处理非齐次模型的算法。它还允许避免单个序列中模式分布的卷积的任何乘积。在处理齐次模型时，算法2通过利用先前的计算来有效获得矩生成函数，从而显著降低了复杂度。在低复杂度或中等复杂度模式的特殊情况下，算法3利用幂运算和二进制分解将时间复杂度进一步降低到对数尺度。然后，在一个简单示例和三个生物数据集上对所有这些算法及其与现有算法相比的相对优势进行了测试和讨论：蛋白质环结构中的结构模式、细菌蛋白质组中的PROSITE签名以及上游基因区域中的转录因子。在这些数据集上，我们还将我们的精确方法与将数据集中的序列连接成单个序列的诱人近似方法进行了比较。

结论

我们的算法被证明是有效的，能够处理具有多个序列的真实数据集以及感兴趣的生物模式，即使后者显示出高复杂度（例如PROSITE签名）。此外，这些精确算法使我们能够避免在单序列近似下观察到的边缘效应，这种效应会导致错误结果，特别是当模型的边际分布向平稳分布的收敛缓慢时。最后，我们对我们的方法及其潜在改进进行了讨论。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc8d/2828453/44649feccbe1/1748-7188-5-15-1.jpg

相似文献

Exact distribution of a pattern in a set of random sequences generated by a Markov source: applications to biological data.马尔可夫源生成的一组随机序列中模式的精确分布：在生物数据中的应用。

Algorithms Mol Biol. 2010 Jan 26;5:15. doi: 10.1186/1748-7188-5-15.

Faster exact Markovian probability functions for motif occurrences: a DFA-only approach.用于基序出现的更快精确马尔可夫概率函数：一种仅基于确定有限自动机的方法。

Bioinformatics. 2008 Dec 15;24(24):2839-48. doi: 10.1093/bioinformatics/btn525. Epub 2008 Oct 9.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区，服用抗叶酸抗疟药物的人群中，叶酸补充剂与疟疾易感性和严重程度的关系。

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

Calculating the exact probability of language-like patterns in biomolecular sequences.计算生物分子序列中类语言模式的精确概率。

Proc Int Conf Intell Syst Mol Biol. 1998;6:17-24.

Deriving non-homogeneous DNA Markov chain models by cluster analysis algorithm minimizing multiple alignment entropy.通过最小化多重比对熵的聚类分析算法推导非齐次DNA马尔可夫链模型。

Comput Chem. 1994 Sep;18(3):259-67. doi: 10.1016/0097-8485(94)85022-4.

Analysis of pattern overlaps and exact computation of P-values of pattern occurrences numbers: case of Hidden Markov Models.模式重叠分析与模式出现次数的P值精确计算：隐马尔可夫模型的情况

Algorithms Mol Biol. 2014 Dec 16;9(1):25. doi: 10.1186/s13015-014-0025-1. eCollection 2014.

Macromolecular crowding: chemistry and physics meet biology (Ascona, Switzerland, 10-14 June 2012).大分子拥挤现象：化学与物理邂逅生物学（瑞士阿斯科纳，2012年6月10日至14日）

Phys Biol. 2013 Aug;10(4):040301. doi: 10.1088/1478-3975/10/4/040301. Epub 2013 Aug 2.

WildSpan: mining structured motifs from protein sequences.WildSpan：从蛋白质序列中挖掘结构化基序

Algorithms Mol Biol. 2011 Mar 31;6(1):6. doi: 10.1186/1748-7188-6-6.

Exact computation of pattern probabilities in random sequences generated by Markov chains.马尔可夫链生成的随机序列中模式概率的精确计算。

Comput Appl Biosci. 1990 Oct;6(4):347-53. doi: 10.1093/bioinformatics/6.4.347.

Algorithms for hidden markov models restricted to occurrences of regular expressions.正则表达式约束的隐马尔可夫模型算法。

Biology (Basel). 2013 Nov 8;2(4):1282-95. doi: 10.3390/biology2041282.

引用本文的文献

Advanced computational predictive models of miRNA-mRNA interaction efficiency.miRNA与mRNA相互作用效率的高级计算预测模型。

Comput Struct Biotechnol J. 2024 Apr 19;23:1740-1754. doi: 10.1016/j.csbj.2024.04.015. eCollection 2024 Dec.

SAFlex: A structural alphabet extension to integrate protein structural flexibility and missing data information.SAFlex：一种结构字母扩展，用于整合蛋白质结构的灵活性和缺失数据信息。

PLoS One. 2018 Jul 5;13(7):e0198854. doi: 10.1371/journal.pone.0198854. eCollection 2018.

Considerations of Protein Subpockets in Fragment-Based Drug Design.基于片段的药物设计中蛋白质亚口袋的考量

Chem Biol Drug Des. 2016 Jan;87(1):5-20. doi: 10.1111/cbdd.12631. Epub 2015 Aug 31.

IMAAAGINE: a webserver for searching hypothetical 3D amino acid side chain arrangements in the Protein Data Bank.想象一下：一个用于在蛋白质数据库中搜索假设的 3D 氨基酸侧链排列的网络服务器。

Nucleic Acids Res. 2013 Jul;41(Web Server issue):W432-40. doi: 10.1093/nar/gkt431. Epub 2013 May 28.

MiRmap: comprehensive prediction of microRNA target repression strength.MiRmap：microRNA 靶基因抑制强度的综合预测。

Nucleic Acids Res. 2012 Dec;40(22):11673-83. doi: 10.1093/nar/gks901. Epub 2012 Oct 2.

SPRITE and ASSAM: web servers for side chain 3D-motif searching in protein structures.SPIRTE 和 ASSAM：用于蛋白质结构中侧链 3D 模体搜索的网络服务器。

Nucleic Acids Res. 2012 Jul;40(Web Server issue):W380-6. doi: 10.1093/nar/gks401. Epub 2012 May 9.

Dissecting protein loops with a statistical scalpel suggests a functional implication of some structural motifs.用统计手术刀剖析蛋白质环，提示了某些结构模体的功能意义。

BMC Bioinformatics. 2011 Jun 20;12:247. doi: 10.1186/1471-2105-12-247.

SA-Mot: a web server for the identification of motifs of interest extracted from protein loops.SA-Mot：一个从蛋白质环中提取感兴趣的模体的 web 服务器。

Nucleic Acids Res. 2011 Jul;39(Web Server issue):W203-9. doi: 10.1093/nar/gkr410. Epub 2011 Jun 10.

Mining protein loops using a structural alphabet and statistical exceptionality.使用结构字母表和统计异常挖掘蛋白质环。

BMC Bioinformatics. 2010 Feb 4;11:75. doi: 10.1186/1471-2105-11-75.

本文引用的文献

GenBank.基因银行

Nucleic Acids Res. 2009 Jan;37(Database issue):D26-31. doi: 10.1093/nar/gkn723. Epub 2008 Oct 21.

Faster exact Markovian probability functions for motif occurrences: a DFA-only approach.用于基序出现的更快精确马尔可夫概率函数：一种仅基于确定有限自动机的方法。

Bioinformatics. 2008 Dec 15;24(24):2839-48. doi: 10.1093/bioinformatics/btn525. Epub 2008 Oct 9.

The Universal Protein Resource (UniProt) 2009.通用蛋白质资源（UniProt）2009 版

Nucleic Acids Res. 2009 Jan;37(Database issue):D169-74. doi: 10.1093/nar/gkn664. Epub 2008 Oct 4.

RSAT: regulatory sequence analysis tools.RSAT：调控序列分析工具。

Nucleic Acids Res. 2008 Jul 1;36(Web Server issue):W119-27. doi: 10.1093/nar/gkn304. Epub 2008 May 21.

The 20 years of PROSITE.PROSITE的二十年。

Nucleic Acids Res. 2008 Jan;36(Database issue):D245-9. doi: 10.1093/nar/gkm977. Epub 2007 Nov 14.

The Genomes On Line Database (GOLD) in 2007: status of genomic and metagenomic projects and their associated metadata.2007年的在线基因组数据库（GOLD）：基因组和宏基因组项目及其相关元数据的状况。

Nucleic Acids Res. 2008 Jan;36(Database issue):D475-9. doi: 10.1093/nar/gkm884. Epub 2007 Nov 2.

Exact p-value calculation for heterotypic clusters of regulatory motifs and its application in computational annotation of cis-regulatory modules.调控基序异型簇的确切p值计算及其在顺式调控模块计算注释中的应用。

Algorithms Mol Biol. 2007 Oct 10;2:13. doi: 10.1186/1748-7188-2-13.

Contribution of horizontally acquired genomic islands to the evolution of the tubercle bacilli.水平获得的基因组岛对结核杆菌进化的贡献。

Mol Biol Evol. 2007 Aug;24(8):1861-71. doi: 10.1093/molbev/msm111. Epub 2007 Jun 1.

Analysis of an optimal hidden Markov model for secondary structure prediction.用于二级结构预测的最优隐马尔可夫模型分析。

BMC Struct Biol. 2006 Dec 13;6:25. doi: 10.1186/1472-6807-6-25.

Numerical solutions for patterns statistics on Markov chains.马尔可夫链模式统计的数值解。

Stat Appl Genet Mol Biol. 2006;5:Article26. doi: 10.2202/1544-6115.1219. Epub 2006 Oct 17.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

马尔可夫源生成的一组随机序列中模式的精确分布：在生物数据中的应用。

Exact distribution of a pattern in a set of random sequences generated by a Markov source: applications to biological data.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献