• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

高效精确的基序发现

Efficient exact motif discovery.

作者信息

Marschall Tobias, Rahmann Sven

机构信息

Computer Science Department, Bioinformatics for High-Throughput Technologies at the Chair of Algorithm Engineering, TU Dortmund, Dortmund, Germany.

出版信息

Bioinformatics. 2009 Jun 15;25(12):i356-64. doi: 10.1093/bioinformatics/btp188.

DOI:10.1093/bioinformatics/btp188
PMID:19478010
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2687942/
Abstract

MOTIVATION

The motif discovery problem consists of finding over-represented patterns in a collection of biosequences. It is one of the classical sequence analysis problems, but still has not been satisfactorily solved in an exact and efficient manner. This is partly due to the large number of possibilities of defining the motif search space and the notion of over-representation. Even for well-defined formalizations, the problem is frequently solved in an ad hoc manner with heuristics that do not guarantee to find the best motif.

RESULTS

We show how to solve the motif discovery problem (almost) exactly on a practically relevant space of IUPAC generalized string patterns, using the p-value with respect to an i.i.d. model or a Markov model as the measure of over-representation. In particular, (i) we use a highly accurate compound Poisson approximation for the null distribution of the number of motif occurrences. We show how to compute the exact clump size distribution using a recently introduced device called probabilistic arithmetic automaton (PAA). (ii) We define two p-value scores for over-representation, the first one based on the total number of motif occurrences, the second one based on the number of sequences in a collection with at least one occurrence. (iii) We describe an algorithm to discover the optimal pattern with respect to either of the scores. The method exploits monotonicity properties of the compound Poisson approximation and is by orders of magnitude faster than exhaustive enumeration of IUPAC strings (11.8 h compared with an extrapolated runtime of 4.8 years). (iv) We justify the use of the proposed scores for motif discovery by showing our method to outperform other motif discovery algorithms (e.g. MEME, Weeder) on benchmark datasets. We also propose new motifs on Mycobacterium tuberculosis.

AVAILABILITY AND IMPLEMENTATION

The method has been implemented in Java. It can be obtained from http://ls11-www.cs.tu-dortmund.de/people/marschal/paa_md/.

摘要

动机

基序发现问题包括在生物序列集合中寻找过度出现的模式。它是经典的序列分析问题之一,但仍未以精确且高效的方式得到令人满意的解决。部分原因在于定义基序搜索空间的可能性众多以及过度出现的概念。即使对于定义明确的形式化问题,该问题也常常通过不保证能找到最佳基序的启发式方法临时解决。

结果

我们展示了如何在IUPAC广义字符串模式的实际相关空间上(几乎)精确地解决基序发现问题,使用相对于独立同分布模型或马尔可夫模型的p值作为过度出现的度量。具体而言,(i)我们对基序出现次数的零分布使用高度精确的复合泊松近似。我们展示了如何使用最近引入的称为概率算术自动机(PAA)的工具来计算精确的聚集大小分布。(ii)我们为过度出现定义了两个p值分数,第一个基于基序出现的总数,第二个基于集合中至少出现一次的序列数量。(iii)我们描述了一种算法,用于发现相对于这两个分数中任何一个的最优模式。该方法利用了复合泊松近似的单调性属性,并且比IUPAC字符串的穷举枚举快几个数量级(11.8小时与外推运行时间4.8年相比)。(iv)通过在基准数据集上展示我们的方法优于其他基序发现算法(例如MEME、Weeder),我们证明了所提出的分数用于基序发现的合理性。我们还在结核分枝杆菌上提出了新的基序。

可用性和实现

该方法已用Java实现。可从http://ls11 - www.cs.tu - dortmund.de/people/marschal/paa_md/获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5e29/2687942/47dde7d1c85c/btp188f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5e29/2687942/47dde7d1c85c/btp188f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5e29/2687942/47dde7d1c85c/btp188f1.jpg

相似文献

1
Efficient exact motif discovery.高效精确的基序发现
Bioinformatics. 2009 Jun 15;25(12):i356-64. doi: 10.1093/bioinformatics/btp188.
2
Conservative extraction of over-represented extensible motifs.过度呈现的可扩展基序的保守提取。
Bioinformatics. 2005 Jun;21 Suppl 1:i9-18. doi: 10.1093/bioinformatics/bti1051.
3
Faster exact Markovian probability functions for motif occurrences: a DFA-only approach.用于基序出现的更快精确马尔可夫概率函数:一种仅基于确定有限自动机的方法。
Bioinformatics. 2008 Dec 15;24(24):2839-48. doi: 10.1093/bioinformatics/btn525. Epub 2008 Oct 9.
4
Efficient sequential and parallel algorithms for finding edit distance based motifs.用于查找基于编辑距离的基序的高效顺序和并行算法。
BMC Genomics. 2016 Aug 18;17 Suppl 4(Suppl 4):465. doi: 10.1186/s12864-016-2789-9.
5
Efficient representation and P-value computation for high-order Markov motifs.高阶马尔可夫基序的高效表示与P值计算
Bioinformatics. 2008 Aug 15;24(16):i160-6. doi: 10.1093/bioinformatics/btn282.
6
An efficient method for significant motifs discovery from multiple DNA sequences.一种从多个DNA序列中发现显著基序的有效方法。
J Bioinform Comput Biol. 2017 Aug;15(4):1750014. doi: 10.1142/S0219720017500147. Epub 2017 May 11.
7
Estimation and efficient computation of the true probability of recurrence of short linear protein sequence motifs in unrelated proteins.估计和有效计算短线性蛋白质序列基序在无关蛋白质中的真实重现概率。
BMC Bioinformatics. 2010 Jan 7;11:14. doi: 10.1186/1471-2105-11-14.
8
Fast and accurate discovery of degenerate linear motifs in protein sequences.在蛋白质序列中快速准确地发现简并线性基序
PLoS One. 2014 Sep 10;9(9):e106081. doi: 10.1371/journal.pone.0106081. eCollection 2014.
9
A hybrid method for the exact planted (l, d) motif finding problem and its parallelization.用于精确种植 (l, d) 模式问题的混合方法及其并行化。
BMC Bioinformatics. 2012;13 Suppl 17(Suppl 17):S10. doi: 10.1186/1471-2105-13-S17-S10. Epub 2012 Dec 13.
10
Fast and exact quantification of motif occurrences in biological sequences.快速准确地定量生物序列中的基序出现次数。
BMC Bioinformatics. 2021 Sep 18;22(1):445. doi: 10.1186/s12859-021-04355-6.

引用本文的文献

1
Proxi-RIMS-seq2 applied to native microbiomes uncovers hundreds of known and novel m5C methyltransferase specificities.应用于天然微生物群的Proxi-RIMS-seq2揭示了数百种已知和新型的m5C甲基转移酶特异性。
Nucleic Acids Res. 2025 Mar 20;53(6). doi: 10.1093/nar/gkaf226.
2
Proxi-RIMS-seq2 applied to native microbiomes uncovers hundreds of known and novel C methyltransferase specificities.应用于天然微生物群的Proxi-RIMS-seq2揭示了数百种已知和新型的C甲基转移酶特异性。
bioRxiv. 2024 Jul 17:2024.07.15.603628. doi: 10.1101/2024.07.15.603628.
3
A Survey of Archaeal Restriction-Modification Systems.

本文引用的文献

1
Seeder: discriminative seeding DNA motif discovery.播种器:鉴别性播种DNA基序发现
Bioinformatics. 2008 Oct 15;24(20):2303-7. doi: 10.1093/bioinformatics/btn444. Epub 2008 Aug 21.
2
Eukaryotic transcription factor binding sites--modeling and integrative search methods.真核转录因子结合位点——建模与综合搜索方法
Bioinformatics. 2008 Jun 1;24(11):1325-31. doi: 10.1093/bioinformatics/btn198. Epub 2008 Apr 21.
3
Exact p-value calculation for heterotypic clusters of regulatory motifs and its application in computational annotation of cis-regulatory modules.
古菌限制修饰系统综述。
Microorganisms. 2023 Sep 28;11(10):2424. doi: 10.3390/microorganisms11102424.
4
Fast and exact quantification of motif occurrences in biological sequences.快速准确地定量生物序列中的基序出现次数。
BMC Bioinformatics. 2021 Sep 18;22(1):445. doi: 10.1186/s12859-021-04355-6.
5
Rapid identification of methylase specificity (RIMS-seq) jointly identifies methylated motifs and generates shotgun sequencing of bacterial genomes.快速鉴定甲基化酶特异性(RIMS-seq)联合鉴定甲基化基序,并对细菌基因组进行鸟枪法测序。
Nucleic Acids Res. 2021 Nov 8;49(19):e113. doi: 10.1093/nar/gkab705.
6
DiNAMO: highly sensitive DNA motif discovery in high-throughput sequencing data.DiNAMO:高通量测序数据中高度敏感的 DNA 基序发现。
BMC Bioinformatics. 2018 Jun 11;19(1):223. doi: 10.1186/s12859-018-2215-1.
7
libFLASM: a software library for fixed-length approximate string matching.libFLASM:一个用于固定长度近似字符串匹配的软件库。
BMC Bioinformatics. 2016 Nov 10;17(1):454. doi: 10.1186/s12859-016-1320-2.
8
BLSSpeller: exhaustive comparative discovery of conserved cis-regulatory elements.BLSSpeller:保守顺式调控元件的详尽比较发现
Bioinformatics. 2015 Dec 1;31(23):3758-66. doi: 10.1093/bioinformatics/btv466. Epub 2015 Aug 8.
9
Integrative network-based approach identifies key genetic elements in breast invasive carcinoma.基于整合网络的方法识别乳腺浸润性癌中的关键遗传元件。
BMC Genomics. 2015;16 Suppl 5(Suppl 5):S2. doi: 10.1186/1471-2164-16-S5-S2. Epub 2015 May 26.
10
N6-adenosine methylation in MiRNAs.微小RNA中的N6-腺苷甲基化
PLoS One. 2015 Feb 27;10(2):e0118438. doi: 10.1371/journal.pone.0118438. eCollection 2015.
调控基序异型簇的确切p值计算及其在顺式调控模块计算注释中的应用。
Algorithms Mol Biol. 2007 Oct 10;2:13. doi: 10.1186/1748-7188-2-13.
4
Multiple pattern matching: a Markov chain approach.多重模式匹配:一种马尔可夫链方法。
J Math Biol. 2008 Jan;56(1-2):51-92. doi: 10.1007/s00285-007-0109-3. Epub 2007 Aug 1.
5
Improved benchmarks for computational motif discovery.用于计算基序发现的改进基准。
BMC Bioinformatics. 2007 Jun 8;8:193. doi: 10.1186/1471-2105-8-193.
6
MotifCut: regulatory motifs finding with maximum density subgraphs.MotifCut:通过最大密度子图寻找调控基序
Bioinformatics. 2006 Jul 15;22(14):e150-7. doi: 10.1093/bioinformatics/btl243.
7
Analysis of computational approaches for motif discovery.基序发现的计算方法分析
Algorithms Mol Biol. 2006 May 19;1:8. doi: 10.1186/1748-7188-1-8.
8
A survey of motif discovery methods in an integrated framework.在一个集成框架中对基序发现方法的调查。
Biol Direct. 2006 Apr 6;1:11. doi: 10.1186/1745-6150-1-11.
9
The discovery, positioning and verification of a set of transcription-associated motifs in vertebrates.脊椎动物中一组转录相关基序的发现、定位及验证
Genome Biol. 2005;6(12):R104. doi: 10.1186/gb-2005-6-12-r104. Epub 2005 Dec 2.
10
Assessing computational tools for the discovery of transcription factor binding sites.评估用于发现转录因子结合位点的计算工具。
Nat Biotechnol. 2005 Jan;23(1):137-44. doi: 10.1038/nbt1053.