• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

通过应用统计阈值进行快速基序识别。

Fast motif recognition via application of statistical thresholds.

机构信息

David R, Cheriton School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada.

出版信息

BMC Bioinformatics. 2010 Jan 18;11 Suppl 1(Suppl 1):S11. doi: 10.1186/1471-2105-11-S1-S11.

DOI:10.1186/1471-2105-11-S1-S11
PMID:20122182
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3009483/
Abstract

BACKGROUND

Improving the accuracy and efficiency of motif recognition is an important computational challenge that has application to detecting transcription factor binding sites in genomic data. Closely related to motif recognition is the CONSENSUS STRING decision problem that asks, given a parameter d and a set of l-length strings S = {s1, ..., sn}, whether there exists a consensus string that has Hamming distance at most d from any string in S. A set of strings S is pairwise bounded if the Hamming distance between any pair of strings in S is at most 2d. It is trivial to determine whether a set is pairwise bounded, and a set cannot have a consensus string unless it is pairwise bounded. We use CONSENSUS STRING to determine whether or not a pairwise bounded set has a consensus. Unfortunately, CONSENSUS STRING is NP-complete. The lack of an efficient method to solve the CONSENSUS STRING problem has caused it to become a computational bottleneck in MCL-WMR, a motif recognition program capable of solving difficult motif recognition problem instances.

RESULTS

We focus on the development of a method for solving CONSENSUS STRING quickly with a small probability of error. We apply this heuristic to develop a new motif recognition program, sMCL-WMR, which has impressive accuracy and efficiency. We demonstrate the performance of sMCL-WMR in detecting weak motifs in large data sets and in real genomic data sets, and compare the performance to other leading motif recognition programs. In our preliminary discussion of our CONSENSUS STRING algorithm we give insight into the issue of sampling pairwise bounded sets, and discuss its relevance to motif recognition.

CONCLUSION

Our novel heuristic gives birth to a state of the art program, sMCL-WMR, that is capable of detecting weak motifs in data sets with a large number of strings. sMCL-WMR is orders of magnitude faster than its predecessor MCL-WMR and is capable of solving previously unsolved synthetic motif recognition problems. Lastly, sMCL-WMR shows impressive accuracy in detecting transcription factor binding sites in the genomic data and used in the assessment of Tompa et al.

摘要

背景

提高基序识别的准确性和效率是一个重要的计算挑战,它在检测基因组数据中的转录因子结合位点方面有应用。与基序识别密切相关的是共识字符串决策问题,该问题询问,给定参数 d 和一组 l 长度的字符串 S={s1,…,sn},是否存在一个共识字符串,使得其与 S 中的任何字符串的汉明距离都不超过 d。一组字符串 S 是成对有界的,如果 S 中任意两个字符串之间的汉明距离都不超过 2d。确定一组字符串是否是成对有界的是微不足道的,并且除非集合是成对有界的,否则它不可能有一个共识字符串。我们使用 CONSENSUS STRING 来确定成对有界的集合是否有共识字符串。不幸的是,CONSENSUS STRING 是 NP 完全的。缺乏一种有效的方法来解决 CONSENSUS STRING 问题,导致它成为 MCL-WMR 中的一个计算瓶颈,MCL-WMR 是一个能够解决困难基序识别问题实例的基序识别程序。

结果

我们专注于开发一种快速解决 CONSENSUS STRING 问题的方法,同时具有较小的错误概率。我们将此启发式方法应用于开发一个新的基序识别程序 sMCL-WMR,该程序具有令人印象深刻的准确性和效率。我们在大型数据集和真实基因组数据集中检测弱基序的性能,并将性能与其他领先的基序识别程序进行比较。在我们对 CONSENSUS STRING 算法的初步讨论中,我们深入了解了对成对有界集合进行抽样的问题,并讨论了它与基序识别的相关性。

结论

我们的新启发式方法孕育了一个最先进的程序 sMCL-WMR,它能够在具有大量字符串的数据集检测弱基序。sMCL-WMR 比其前身 MCL-WMR 快几个数量级,并且能够解决以前无法解决的合成基序识别问题。最后,sMCL-WMR 在检测基因组数据中的转录因子结合位点和用于评估 Tompa 等人的实验数据方面表现出令人印象深刻的准确性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dbc8/3009483/911be11d06d4/1471-2105-11-S1-S11-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dbc8/3009483/1ec1f8b068d0/1471-2105-11-S1-S11-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dbc8/3009483/911be11d06d4/1471-2105-11-S1-S11-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dbc8/3009483/1ec1f8b068d0/1471-2105-11-S1-S11-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dbc8/3009483/911be11d06d4/1471-2105-11-S1-S11-2.jpg

相似文献

1
Fast motif recognition via application of statistical thresholds.通过应用统计阈值进行快速基序识别。
BMC Bioinformatics. 2010 Jan 18;11 Suppl 1(Suppl 1):S11. doi: 10.1186/1471-2105-11-S1-S11.
2
Fast exact algorithms for the closest string and substring problems with application to the planted (L, d)-motif model.快速精确算法求解最接近字符串和子字符串问题及其在 (L, d)-基序模型中的应用。
IEEE/ACM Trans Comput Biol Bioinform. 2011 Sep-Oct;8(5):1400-10. doi: 10.1109/TCBB.2011.21.
3
Closest string with outliers.带有异常值的最近字符串。
BMC Bioinformatics. 2011 Feb 15;12 Suppl 1(Suppl 1):S55. doi: 10.1186/1471-2105-12-S1-S55.
4
Efficient sequential and parallel algorithms for finding edit distance based motifs.用于查找基于编辑距离的基序的高效顺序和并行算法。
BMC Genomics. 2016 Aug 18;17 Suppl 4(Suppl 4):465. doi: 10.1186/s12864-016-2789-9.
5
A fast weak motif-finding algorithm based on community detection in graphs.基于图中社区检测的快速弱模式发现算法。
BMC Bioinformatics. 2013 Jul 17;14:227. doi: 10.1186/1471-2105-14-227.
6
Efficient motif finding algorithms for large-alphabet inputs.针对大字母表输入的高效基序发现算法。
BMC Bioinformatics. 2010 Oct 26;11 Suppl 8(Suppl 8):S1. doi: 10.1186/1471-2105-11-S8-S1.
7
Prediction of cis-regulatory elements: from high-information content analysis to motif identification.顺式调控元件的预测:从高信息含量分析到基序识别
J Bioinform Comput Biol. 2007 Aug;5(4):817-38. doi: 10.1142/s021972000700293x.
8
DNA motif representation with nucleotide dependency.具有核苷酸依赖性的DNA基序表示
IEEE/ACM Trans Comput Biol Bioinform. 2008 Jan-Mar;5(1):110-9. doi: 10.1109/TCBB.2007.70220.
9
PMS6: a fast algorithm for motif discovery.PMS6:一种用于基序发现的快速算法。
Int J Bioinform Res Appl. 2014;10(4-5):369-83. doi: 10.1504/IJBRA.2014.062990.
10
On the hardness of counting and sampling center strings.计算和采样中心字符串的难度。
IEEE/ACM Trans Comput Biol Bioinform. 2012 Nov-Dec;9(6):1843-6. doi: 10.1109/TCBB.2012.84.

引用本文的文献

1
SamSelect: a sample sequence selection algorithm for quorum planted motif search on large DNA datasets.SamSelect:一种用于在大型 DNA 数据集上进行约定种植基序搜索的样本序列选择算法。
BMC Bioinformatics. 2018 Jun 18;19(1):228. doi: 10.1186/s12859-018-2242-y.
2
A Comparative Analysis Between k-Mers and Community Detection-Based Features for the Task of Protein Classification.用于蛋白质分类任务的k-mer与基于社区检测的特征之间的比较分析
IEEE Trans Nanobioscience. 2016 Mar;15(2):84-92. doi: 10.1109/TNB.2016.2523501. Epub 2016 Feb 3.
3
A fast weak motif-finding algorithm based on community detection in graphs.

本文引用的文献

1
Fast and practical algorithms for planted (l, d) motif search.用于植入式(l, d)基序搜索的快速实用算法。
IEEE/ACM Trans Comput Biol Bioinform. 2007 Oct-Dec;4(4):544-52. doi: 10.1109/TCBB.2007.70241.
2
Exact algorithms for planted motif problems.植入基序问题的精确算法。
J Comput Biol. 2005 Oct;12(8):1117-28. doi: 10.1089/cmb.2005.12.1117.
3
Identification of transcription factor binding sites with variable-order Bayesian networks.利用可变阶贝叶斯网络识别转录因子结合位点。
基于图中社区检测的快速弱模式发现算法。
BMC Bioinformatics. 2013 Jul 17;14:227. doi: 10.1186/1471-2105-14-227.
4
PairMotif+: a fast and effective algorithm for de novo motif discovery in DNA sequences.PairMotif+:一种快速有效的 DNA 序列从头发现基序的算法。
Int J Biol Sci. 2013 Apr 29;9(4):412-24. doi: 10.7150/ijbs.5786. Print 2013.
Bioinformatics. 2005 Jun 1;21(11):2657-66. doi: 10.1093/bioinformatics/bti410. Epub 2005 Mar 29.
4
Assessing computational tools for the discovery of transcription factor binding sites.评估用于发现转录因子结合位点的计算工具。
Nat Biotechnol. 2005 Jan;23(1):137-44. doi: 10.1038/nbt1053.
5
Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes.Weeder Web:在一组共调控基因的序列中发现转录因子结合位点
Nucleic Acids Res. 2004 Jul 1;32(Web Server issue):W199-203. doi: 10.1093/nar/gkh465.
6
Finding functional sequence elements by multiple local alignment.通过多重局部比对寻找功能序列元件。
Nucleic Acids Res. 2004 Jan 2;32(1):189-200. doi: 10.1093/nar/gkh169. Print 2004.
7
Finding composite regulatory patterns in DNA sequences.在DNA序列中寻找复合调控模式。
Bioinformatics. 2002;18 Suppl 1:S354-63. doi: 10.1093/bioinformatics/18.suppl_1.s354.
8
Finding motifs using random projections.使用随机投影寻找基序。
J Comput Biol. 2002;9(2):225-42. doi: 10.1089/10665270252935430.
9
Combinatorial approaches to finding subtle signals in DNA sequences.在DNA序列中寻找细微信号的组合方法。
Proc Int Conf Intell Syst Mol Biol. 2000;8:269-78.
10
TRANSFAC: a database on transcription factors and their DNA binding sites.TRANSFAC:一个关于转录因子及其DNA结合位点的数据库。
Nucleic Acids Res. 1996 Jan 1;24(1):238-41. doi: 10.1093/nar/24.1.238.