• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于稀有事件抽样的位置相关评分的局部序列比对的精确统计。

Accurate statistics for local sequence alignment with position-dependent scoring by rare-event sampling.

机构信息

Laboratoire MAP5 (UMR CNRS 8145), Université Paris Descartes, Paris, France.

出版信息

BMC Bioinformatics. 2011 Feb 3;12:47. doi: 10.1186/1471-2105-12-47.

DOI:10.1186/1471-2105-12-47
PMID:21291566
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3042914/
Abstract

BACKGROUND

Molecular database search tools need statistical models to assess the significance for the resulting hits. In the classical approach one asks the question how probable a certain score is observed by pure chance. Asymptotic theories for such questions are available for two random i.i.d. sequences. Some effort had been made to include effects of finite sequence lengths and to account for specific compositions of the sequences. In many applications, such as a large-scale database homology search for transmembrane proteins, these models are not the most appropriate ones. Search sensitivity and specificity benefit from position-dependent scoring schemes or use of Hidden Markov Models. Additional, one may wish to go beyond the assumption that the sequences are i.i.d. Despite their practical importance, the statistical properties of these settings have not been well investigated yet.

RESULTS

In this paper, we discuss an efficient and general method to compute the score distribution to any desired accuracy. The general approach may be applied to different sequence models and and various similarity measures that satisfy a few weak assumptions. We have access to the low-probability region ("tail") of the distribution where scores are larger than expected by pure chance and therefore relevant for practical applications. Our method uses recent ideas from rare-event simulations, combining Markov chain Monte Carlo simulations with importance sampling and generalized ensembles. We present results for the score statistics of fixed and random queries against random sequences. In a second step, we extend the approach to a model of transmembrane proteins, which can hardly be described as i.i.d. sequences. For this case, we compare the statistical properties of a fixed query model as well as a hidden Markov sequence model in connection with a position based scoring scheme against the classical approach.

CONCLUSIONS

The results illustrate that the sensitivity and specificity strongly depend on the underlying scoring and sequence model. A specific ROC analysis for the case of transmembrane proteins supports our observation.

摘要

背景

分子数据库搜索工具需要统计模型来评估结果命中的显著性。在经典方法中,人们会问一个问题,即某个特定分数是纯偶然观察到的概率有多大。对于这种问题的渐近理论适用于两个随机独立同分布的序列。已经做了一些努力来包括有限序列长度的影响,并考虑序列的特定组成。在许多应用中,例如跨膜蛋白的大规模数据库同源搜索,这些模型并不是最合适的。搜索的敏感性和特异性受益于基于位置的评分方案或使用隐马尔可夫模型。此外,人们可能希望超越序列是独立同分布的假设。尽管它们具有实际重要性,但这些设置的统计性质尚未得到很好的研究。

结果

在本文中,我们讨论了一种计算任何所需精度的分数分布的有效且通用的方法。该通用方法可应用于不同的序列模型和满足一些弱假设的各种相似性度量。我们可以访问分布的低概率区域(“尾部”),其中分数大于纯偶然的预期,因此与实际应用相关。我们的方法使用了来自稀有事件模拟的最新思想,将马尔可夫链蒙特卡罗模拟与重要性抽样和广义集合相结合。我们给出了针对随机序列的固定查询和随机查询的分数统计结果。在第二步中,我们将方法扩展到跨膜蛋白模型,该模型很难描述为独立同分布序列。对于这种情况,我们比较了固定查询模型和隐马尔可夫序列模型的统计特性,以及与基于位置的评分方案的经典方法。

结论

结果表明,敏感性和特异性强烈依赖于基础评分和序列模型。针对跨膜蛋白的特定 ROC 分析支持了我们的观察结果。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e15/3042914/a80d725b7b04/1471-2105-12-47-7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e15/3042914/54886f66c097/1471-2105-12-47-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e15/3042914/6546c96b0b8e/1471-2105-12-47-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e15/3042914/3f9dec395ad9/1471-2105-12-47-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e15/3042914/a94f6bcffed9/1471-2105-12-47-4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e15/3042914/8af717f4de16/1471-2105-12-47-5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e15/3042914/5ef03049ec49/1471-2105-12-47-6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e15/3042914/a80d725b7b04/1471-2105-12-47-7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e15/3042914/54886f66c097/1471-2105-12-47-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e15/3042914/6546c96b0b8e/1471-2105-12-47-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e15/3042914/3f9dec395ad9/1471-2105-12-47-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e15/3042914/a94f6bcffed9/1471-2105-12-47-4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e15/3042914/8af717f4de16/1471-2105-12-47-5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e15/3042914/5ef03049ec49/1471-2105-12-47-6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7e15/3042914/a80d725b7b04/1471-2105-12-47-7.jpg

相似文献

1
Accurate statistics for local sequence alignment with position-dependent scoring by rare-event sampling.基于稀有事件抽样的位置相关评分的局部序列比对的精确统计。
BMC Bioinformatics. 2011 Feb 3;12:47. doi: 10.1186/1471-2105-12-47.
2
Statistical significance of probabilistic sequence alignment and related local hidden Markov models.概率序列比对及相关局部隐马尔可夫模型的统计学显著性。
J Comput Biol. 2001;8(3):249-82. doi: 10.1089/10665270152530845.
3
Toward an accurate statistics of gapped alignments.迈向空位比对的精确统计。
Bull Math Biol. 2005 Jan;67(1):169-91. doi: 10.1016/j.bulm.2004.07.001.
4
Making sense of score statistics for sequence alignments.理解序列比对的得分统计。
Brief Bioinform. 2001 Mar;2(1):51-67. doi: 10.1093/bib/2.1.51.
5
Bayesian restoration of a hidden Markov chain with applications to DNA sequencing.应用于DNA测序的隐马尔可夫链的贝叶斯恢复
J Comput Biol. 1999 Summer;6(2):261-77. doi: 10.1089/cmb.1999.6.261.
6
Bayesian coestimation of phylogeny and sequence alignment.系统发育与序列比对的贝叶斯联合估计
BMC Bioinformatics. 2005 Apr 1;6:83. doi: 10.1186/1471-2105-6-83.
7
Gapped alignment of protein sequence motifs through Monte Carlo optimization of a hidden Markov model.通过隐马尔可夫模型的蒙特卡罗优化实现蛋白质序列基序的间隙比对。
BMC Bioinformatics. 2004 Oct 25;5:157. doi: 10.1186/1471-2105-5-157.
8
Evolution of biological sequences implies an extreme value distribution of type I for both global and local pairwise alignment scores.生物序列的进化意味着全局和局部两两比对得分都呈I型极值分布。
BMC Bioinformatics. 2008 Aug 7;9:332. doi: 10.1186/1471-2105-9-332.
9
Bayesian Top-Down Protein Sequence Alignment with Inferred Position-Specific Gap Penalties.具有推断位置特异性间隙罚分的贝叶斯自上而下蛋白质序列比对
PLoS Comput Biol. 2016 May 18;12(5):e1004936. doi: 10.1371/journal.pcbi.1004936. eCollection 2016 May.
10
Statistical significance of threading scores.穿线分数的统计学显著性。
J Comput Biol. 2012 Jan;19(1):13-29. doi: 10.1089/cmb.2011.0236. Epub 2011 Dec 9.

引用本文的文献

1
A BLAST from the past: revisiting blastp's E-value.来自过去的一次冲击:重新审视Blastp的E值。
Bioinformatics. 2024 Nov 28;40(12). doi: 10.1093/bioinformatics/btae729.

本文引用的文献

1
Significance of gapped sequence alignments.缺口序列比对的意义。
J Comput Biol. 2008 Nov;15(9):1187-94. doi: 10.1089/cmb.2008.0125.
2
A probabilistic model of local sequence alignment that simplifies statistical significance estimation.一种简化统计显著性估计的局部序列比对概率模型。
PLoS Comput Biol. 2008 May 30;4(5):e1000069. doi: 10.1371/journal.pcbi.1000069.
3
Local sequence alignments statistics: deviations from Gumbel statistics in the rare-event tail.局部序列比对统计:罕见事件尾部与耿贝尔统计的偏差。
Algorithms Mol Biol. 2007 Jul 11;2:9. doi: 10.1186/1748-7188-2-9.
4
Score statistics of global sequence alignment from the energy distribution of a modified directed polymer and directed percolation problem.基于修正有向聚合物的能量分布和有向渗流问题的全局序列比对得分统计。
Phys Rev E Stat Nonlin Soft Matter Phys. 2005 Dec;72(6 Pt 1):061917. doi: 10.1103/PhysRevE.72.061917. Epub 2005 Dec 23.
5
Exact asymptotic results for the Bernoulli matching model of sequence alignment.
Phys Rev E Stat Nonlin Soft Matter Phys. 2005 Aug;72(2 Pt 1):020901. doi: 10.1103/PhysRevE.72.020901. Epub 2005 Aug 2.
6
The Universal Protein Resource (UniProt).通用蛋白质资源(UniProt)。
Nucleic Acids Res. 2005 Jan 1;33(Database issue):D154-9. doi: 10.1093/nar/gki070.
7
Optimizing the ensemble for equilibration in broad-histogram Monte Carlo simulations.在宽直方图蒙特卡罗模拟中优化系综以实现平衡
Phys Rev E Stat Nonlin Soft Matter Phys. 2004 Oct;70(4 Pt 2):046701. doi: 10.1103/PhysRevE.70.046701. Epub 2004 Oct 4.
8
The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions.用于比较具有非标准组成的蛋白质的氨基酸替换矩阵的构建。
Bioinformatics. 2005 Apr 1;21(7):902-11. doi: 10.1093/bioinformatics/bti070. Epub 2004 Oct 27.
9
Performance limitations of flat-histogram methods.平面直方图方法的性能限制。
Phys Rev Lett. 2004 Mar 5;92(9):097201. doi: 10.1103/PhysRevLett.92.097201. Epub 2004 Mar 2.
10
The ASTRAL Compendium in 2004.2004年的《星盘汇编》。
Nucleic Acids Res. 2004 Jan 1;32(Database issue):D189-92. doi: 10.1093/nar/gkh034.