• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

位置权重矩阵的高效准确P值计算。

Efficient and accurate P-value computation for Position Weight Matrices.

作者信息

Touzet Hélène, Varré Jean-Stéphane

机构信息

LIFL, UMR CNRS 8022, Université des Sciences et Technologies de Lille, 59655 Villeneuve d'Ascq, France.

出版信息

Algorithms Mol Biol. 2007 Dec 11;2:15. doi: 10.1186/1748-7188-2-15.

DOI:10.1186/1748-7188-2-15
PMID:18072973
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2238751/
Abstract

BACKGROUND

Position Weight Matrices (PWMs) are probabilistic representations of signals in sequences. They are widely used to model approximate patterns in DNA or in protein sequences. The usage of PWMs needs as a prerequisite to knowing the statistical significance of a word according to its score. This is done by defining the P-value of a score, which is the probability that the background model can achieve a score larger than or equal to the observed value. This gives rise to the following problem: Given a P-value, find the corresponding score threshold. Existing methods rely on dynamic programming or probability generating functions. For many examples of PWMs, they fail to give accurate results in a reasonable amount of time.

RESULTS

The contribution of this paper is two fold. First, we study the theoretical complexity of the problem, and we prove that it is NP-hard. Then, we describe a novel algorithm that solves the P-value problem efficiently. The main idea is to use a series of discretized score distributions that improves the final result step by step until some convergence criterion is met. Moreover, the algorithm is capable of calculating the exact P-value without any error, even for matrices with non-integer coefficient values. The same approach is also used to devise an accurate algorithm for the reverse problem: finding the P-value for a given score. Both methods are implemented in a software called TFM-PVALUE, that is freely available.

CONCLUSION

We have tested TFM-PVALUE on a large set of PWMs representing transcription factor binding sites. Experimental results show that it achieves better performance in terms of computational time and precision than existing tools.

摘要

背景

位置权重矩阵(PWMs)是序列中信号的概率表示。它们被广泛用于对DNA或蛋白质序列中的近似模式进行建模。使用PWMs需要预先知道一个单词根据其得分的统计显著性。这是通过定义得分的P值来完成的,P值是背景模型能够获得大于或等于观测值的得分的概率。这就产生了以下问题:给定一个P值,找到相应的得分阈值。现有方法依赖于动态规划或概率生成函数。对于许多PWMs的例子,它们在合理的时间内无法给出准确的结果。

结果

本文的贡献有两个方面。首先,我们研究了该问题的理论复杂度,并证明它是NP难的。然后,我们描述了一种能有效解决P值问题的新算法。主要思想是使用一系列离散化的得分分布,逐步改进最终结果,直到满足某个收敛标准。此外,该算法能够精确计算P值,没有任何误差,即使对于系数值为非整数的矩阵也是如此。同样的方法也被用于设计一个精确的算法来解决反向问题:为给定的得分找到P值。这两种方法都在一个名为TFM-PVALUE的软件中实现,该软件可免费获取。

结论

我们在一大组表示转录因子结合位点的PWMs上测试了TFM-PVALUE。实验结果表明,与现有工具相比,它在计算时间和精度方面都有更好的性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f067/2238751/22a6de6e4495/1748-7188-2-15-11.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f067/2238751/837fa690a655/1748-7188-2-15-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f067/2238751/ff13fe4afbda/1748-7188-2-15-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f067/2238751/86c80438636d/1748-7188-2-15-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f067/2238751/6fcc4483a15e/1748-7188-2-15-4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f067/2238751/28949766bc63/1748-7188-2-15-5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f067/2238751/e780f7d31375/1748-7188-2-15-6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f067/2238751/6bd2b7277b05/1748-7188-2-15-7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f067/2238751/5022811e056e/1748-7188-2-15-8.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f067/2238751/8ee8c64e1362/1748-7188-2-15-9.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f067/2238751/9be8cbb92354/1748-7188-2-15-10.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f067/2238751/22a6de6e4495/1748-7188-2-15-11.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f067/2238751/837fa690a655/1748-7188-2-15-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f067/2238751/ff13fe4afbda/1748-7188-2-15-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f067/2238751/86c80438636d/1748-7188-2-15-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f067/2238751/6fcc4483a15e/1748-7188-2-15-4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f067/2238751/28949766bc63/1748-7188-2-15-5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f067/2238751/e780f7d31375/1748-7188-2-15-6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f067/2238751/6bd2b7277b05/1748-7188-2-15-7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f067/2238751/5022811e056e/1748-7188-2-15-8.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f067/2238751/8ee8c64e1362/1748-7188-2-15-9.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f067/2238751/9be8cbb92354/1748-7188-2-15-10.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f067/2238751/22a6de6e4495/1748-7188-2-15-11.jpg

相似文献

1
Efficient and accurate P-value computation for Position Weight Matrices.位置权重矩阵的高效准确P值计算。
Algorithms Mol Biol. 2007 Dec 11;2:15. doi: 10.1186/1748-7188-2-15.
2
Reliable scaling of position weight matrices for binding strength comparisons between transcription factors.用于转录因子之间结合强度比较的位置权重矩阵的可靠缩放。
BMC Bioinformatics. 2015 Aug 20;16:265. doi: 10.1186/s12859-015-0666-1.
3
Computing exact P-values for DNA motifs.计算DNA基序的精确P值。
Bioinformatics. 2007 Mar 1;23(5):531-7. doi: 10.1093/bioinformatics/btl662. Epub 2007 Jan 18.
4
Jaccard index based similarity measure to compare transcription factor binding site models.基于杰卡德指数的相似度度量,用于比较转录因子结合位点模型。
Algorithms Mol Biol. 2013 Sep 30;8(1):23. doi: 10.1186/1748-7188-8-23.
5
Faster exact Markovian probability functions for motif occurrences: a DFA-only approach.用于基序出现的更快精确马尔可夫概率函数:一种仅基于确定有限自动机的方法。
Bioinformatics. 2008 Dec 15;24(24):2839-48. doi: 10.1093/bioinformatics/btn525. Epub 2008 Oct 9.
6
Computational technique for improvement of the position-weight matrices for the DNA/protein binding sites.用于改进DNA/蛋白质结合位点位置权重矩阵的计算技术。
Nucleic Acids Res. 2005 Apr 22;33(7):2290-301. doi: 10.1093/nar/gki519. Print 2005.
7
On counting position weight matrix matches in a sequence, with application to discriminative motif finding.关于计算序列中的位置权重矩阵匹配及其在判别性基序发现中的应用。
Bioinformatics. 2006 Jul 15;22(14):e454-63. doi: 10.1093/bioinformatics/btl227.
8
Bayesian Markov models consistently outperform PWMs at predicting motifs in nucleotide sequences.在预测核苷酸序列中的基序方面,贝叶斯马尔可夫模型始终优于位置权重矩阵。
Nucleic Acids Res. 2016 Jul 27;44(13):6055-69. doi: 10.1093/nar/gkw521. Epub 2016 Jun 9.
9
Learning position weight matrices from sequence and expression data.从序列和表达数据中学习位置权重矩阵。
Comput Syst Bioinformatics Conf. 2007;6:249-60.
10
Chromosome structures: reduction of certain problems with unequal gene content and gene paralogs to integer linear programming.染色体结构:将某些具有不等基因含量和基因旁系同源物的问题简化为整数线性规划。
BMC Bioinformatics. 2017 Dec 6;18(1):537. doi: 10.1186/s12859-017-1944-x.

引用本文的文献

1
Stimulating Wnt signaling reveals context-dependent genetic effects on gene regulation in primary human neural progenitors.刺激Wnt信号通路揭示了对原代人神经祖细胞基因调控的背景依赖性遗传效应。
Nat Neurosci. 2024 Dec;27(12):2430-2442. doi: 10.1038/s41593-024-01773-6. Epub 2024 Sep 30.
2
Primary multistep phosphorelay activation comprises both cytokinin and abiotic stress responses: insights from comparative analysis of Brassica type-A response regulators.初级多步磷酸化级联激活包含细胞分裂素和非生物胁迫响应:拟南芥 A 型反应调节蛋白的比较分析得出的见解。
J Exp Bot. 2024 Oct 30;75(20):6346-6368. doi: 10.1093/jxb/erae335.
3

本文引用的文献

1
Computing exact P-values for DNA motifs.计算DNA基序的精确P值。
Bioinformatics. 2007 Mar 1;23(5):531-7. doi: 10.1093/bioinformatics/btl662. Epub 2007 Jan 18.
2
Fast index based algorithms and software for matching position specific scoring matrices.用于匹配位置特异性评分矩阵的基于快速索引的算法和软件。
BMC Bioinformatics. 2006 Aug 24;7:389. doi: 10.1186/1471-2105-7-389.
3
Efficient exact p-value computation for small sample, sparse, and surprising categorical data.针对小样本、稀疏且惊人的分类数据进行高效精确的p值计算。
Computational Reconstruction of the Transcription Factor Regulatory Network Induced by Auxin in L.
生长素诱导的番茄转录因子调控网络的计算重建
Plants (Basel). 2024 Jul 10;13(14):1905. doi: 10.3390/plants13141905.
4
A common regulatory haplotype doubles lactoferrin concentration in milk.常见的调控单倍型使牛奶中乳铁蛋白浓度加倍。
Genet Sel Evol. 2024 Mar 28;56(1):22. doi: 10.1186/s12711-024-00890-x.
5
Allele-specific control of rodent and human lncRNA KMT2E-AS1 promotes hypoxic endothelial pathology in pulmonary hypertension.等位基因特异性调控鼠和人长链非编码 RNA KMT2E-AS1 促进肺动脉高压缺氧性内皮病理。
Sci Transl Med. 2024 Jan 10;16(729):eadd2029. doi: 10.1126/scitranslmed.add2029.
6
Paired yeast one-hybrid assays to detect DNA-binding cooperativity and antagonism across transcription factors.通过酵母双杂交实验检测转录因子之间的 DNA 结合协同作用和拮抗作用。
Nat Commun. 2023 Oct 18;14(1):6570. doi: 10.1038/s41467-023-42445-6.
7
Finding motifs using DNA images derived from sparse representations.基于稀疏表示的 DNA 图像寻找基序。
Bioinformatics. 2023 Jun 1;39(6). doi: 10.1093/bioinformatics/btad378.
8
Bioinformatics pipeline to guide post-GWAS studies in Alzheimer's: A new catalogue of disease candidate short structural variants.生物信息学分析流程指导阿尔茨海默病的 GWAS 后研究:候选疾病短结构变异的新目录。
Alzheimers Dement. 2023 Sep;19(9):4094-4109. doi: 10.1002/alz.13168. Epub 2023 May 30.
9
Widespread perturbation of ETS factor binding sites in cancer.癌症中 ETS 因子结合位点的广泛扰动。
Nat Commun. 2023 Feb 17;14(1):913. doi: 10.1038/s41467-023-36535-8.
10
Sphingosine-1-phosphate Signalling in Aneurysmal Subarachnoid Haemorrhage: Basic Science to Clinical Translation.鞘氨醇-1-磷酸信号在颅内动脉瘤性蛛网膜下腔出血中的作用:基础科学到临床转化。
Transl Stroke Res. 2024 Apr;15(2):352-363. doi: 10.1007/s12975-023-01133-9. Epub 2023 Feb 7.
J Comput Biol. 2004;11(5):867-86. doi: 10.1089/cmb.2004.11.867.
4
WebLogo: a sequence logo generator.WebLogo:一个序列图生成器。
Genome Res. 2004 Jun;14(6):1188-90. doi: 10.1101/gr.849004.
5
Determination of local statistical significance of patterns in Markov sequences with application to promoter element identification.马尔可夫序列中模式的局部统计显著性测定及其在启动子元件识别中的应用
J Comput Biol. 2004;11(1):1-14. doi: 10.1089/106652704773416858.
6
Recent improvements to the PROSITE database.PROSITE数据库的近期改进。
Nucleic Acids Res. 2004 Jan 1;32(Database issue):D134-7. doi: 10.1093/nar/gkh044.
7
JASPAR: an open-access database for eukaryotic transcription factor binding profiles.JASPAR:一个用于真核转录因子结合图谱的开放获取数据库。
Nucleic Acids Res. 2004 Jan 1;32(Database issue):D91-4. doi: 10.1093/nar/gkh012.
8
Fast probabilistic analysis of sequence function using scoring matrices.使用评分矩阵对序列功能进行快速概率分析。
Bioinformatics. 2000 Mar;16(3):233-44. doi: 10.1093/bioinformatics/16.3.233.
9
TRANSFAC: an integrated system for gene expression regulation.TRANSFAC:一个用于基因表达调控的综合系统。
Nucleic Acids Res. 2000 Jan 1;28(1):316-9. doi: 10.1093/nar/28.1.316.
10
Identifying DNA and protein patterns with statistically significant alignments of multiple sequences.通过多条序列具有统计学意义的比对来识别DNA和蛋白质模式。
Bioinformatics. 1999 Jul-Aug;15(7-8):563-77. doi: 10.1093/bioinformatics/15.7.563.