• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用评分矩阵对序列功能进行快速概率分析。

Fast probabilistic analysis of sequence function using scoring matrices.

作者信息

Wu T D, Nevill-Manning C G, Brutlag D L

机构信息

Department of Biochemistry, Stanford University School of Medicine, Stanford, CA, USA.

出版信息

Bioinformatics. 2000 Mar;16(3):233-44. doi: 10.1093/bioinformatics/16.3.233.

DOI:10.1093/bioinformatics/16.3.233
PMID:10869016
Abstract

MOTIVATION

We present techniques for increasing the speed of sequence analysis using scoring matrices. Our techniques are based on calculating, for a given scoring matrix, the quantile function, which assigns a probability, or p, value to each segmental score. Our techniques also permit the user to specify a p threshold to indicate the desired trade-off between sensitivity and speed for a particular sequence analysis. The resulting increase in speed should allow scoring matrices to be used more widely in large-scale sequencing and annotation projects.

RESULTS

We develop three techniques for increasing the speed of sequence analysis: probability filtering, lookahead scoring, and permuted lookahead scoring. In probability filtering, we compute the score threshold that corresponds to the user-specified p threshold. We use the score threshold to limit the number of segments that are retained in the search process. In lookahead scoring, we test intermediate scores to determine whether they will possibly exceed the score threshold. In permuted lookahead scoring, we score each segment in a particular order designed to maximize the likelihood of early termination. Our two lookahead scoring techniques reduce substantially the number of residues that must be examined. The fraction of residues examined ranges from 62 to 6%, depending on the p threshold chosen by the user. These techniques permit sequence analysis with scoring matrices at speeds that are several times faster than existing programs. On a database of 12 177 alignment blocks, our techniques permit sequence analysis at a speed of 225 residues/s for a p threshold of 10-6, and 541 residues/s for a p threshold of 10-20. In order to compute the quantile function, we may use either an independence assumption or a Markov assumption. We measure the effect of first- and second-order Markov assumptions and find that they tend to raise the p value of segments, when compared with the independence assumption, by average ratios of 1.30 and 1.69, respectively. We also compare our technique with the empirical 99. 5th percentile scores compiled in the BLOCKSPLUS database, and find that they correspond on average to a p value of 1.5 x 10-5.

AVAILABILITY

The techniques described above are implemented in a software package called EMATRIX. This package is available from the authors for free academic use or for licensed commercial use. The EMATRIX set of programs is also available on the Internet at http://motif.stanford.edu/ematrix.

摘要

动机

我们提出了使用评分矩阵提高序列分析速度的技术。我们的技术基于为给定的评分矩阵计算分位数函数,该函数为每个片段得分赋予一个概率或p值。我们的技术还允许用户指定一个p阈值,以表明在特定序列分析中灵敏度和速度之间所需的权衡。速度的提高应使评分矩阵在大规模测序和注释项目中得到更广泛的应用。

结果

我们开发了三种提高序列分析速度的技术:概率过滤、前瞻评分和置换前瞻评分。在概率过滤中,我们计算与用户指定的p阈值相对应的得分阈值。我们使用得分阈值来限制在搜索过程中保留的片段数量。在前瞻评分中,我们测试中间得分以确定它们是否可能超过得分阈值。在置换前瞻评分中,我们以特定顺序对每个片段进行评分,以最大限度地提高提前终止的可能性。我们的两种前瞻评分技术大大减少了必须检查的残基数量。根据用户选择的p阈值,检查的残基比例范围从62%到6%。这些技术允许使用评分矩阵进行序列分析,速度比现有程序快几倍。在一个包含12177个比对块的数据库上,对于p阈值为10^-6,我们的技术允许以225个残基/秒的速度进行序列分析,对于p阈值为10^-20,则为541个残基/秒。为了计算分位数函数,我们可以使用独立性假设或马尔可夫假设。我们测量了一阶和二阶马尔可夫假设的影响,发现与独立性假设相比,它们往往会使片段的p值分别平均提高1.30和1.69倍。我们还将我们的技术与BLOCKSPLUS数据库中汇编的经验第99.5百分位数得分进行了比较,发现它们平均对应于1.5×10^-5的p值。

可用性

上述技术在一个名为EMATRIX的软件包中实现。该软件包可从作者处获得,供学术免费使用或商业许可使用。EMATRIX程序集也可在互联网上获取,网址为http://motif.stanford.edu/ematrix。

相似文献

1
Fast probabilistic analysis of sequence function using scoring matrices.使用评分矩阵对序列功能进行快速概率分析。
Bioinformatics. 2000 Mar;16(3):233-44. doi: 10.1093/bioinformatics/16.3.233.
2
Minimal-risk scoring matrices for sequence analysis.用于序列分析的最小风险评分矩阵。
J Comput Biol. 1999 Summer;6(2):219-35. doi: 10.1089/cmb.1999.6.219.
3
Calibrating E-values for hidden Markov models using reverse-sequence null models.使用反向序列空模型校准隐马尔可夫模型的E值。
Bioinformatics. 2005 Nov 15;21(22):4107-15. doi: 10.1093/bioinformatics/bti629. Epub 2005 Aug 25.
4
PR2ALIGN: a stand-alone software program and a web-server for protein sequence alignment using weighted biochemical properties of amino acids.PR2ALIGN:一个用于利用氨基酸加权生化特性进行蛋白质序列比对的独立软件程序和网络服务器。
BMC Res Notes. 2015 May 7;8:187. doi: 10.1186/s13104-015-1152-6.
5
GeneGenerator--a flexible algorithm for gene prediction and its application to maize sequences.基因生成器——一种用于基因预测的灵活算法及其在玉米序列中的应用。
Bioinformatics. 1998;14(3):232-43. doi: 10.1093/bioinformatics/14.3.232.
6
Efficient representation and P-value computation for high-order Markov motifs.高阶马尔可夫基序的高效表示与P值计算
Bioinformatics. 2008 Aug 15;24(16):i160-6. doi: 10.1093/bioinformatics/btn282.
7
ProClust: improved clustering of protein sequences with an extended graph-based approach.ProClust:基于扩展的图形方法改进蛋白质序列聚类
Bioinformatics. 2002;18 Suppl 2:S182-91. doi: 10.1093/bioinformatics/18.suppl_2.s182.
8
Software for the development and evaluation of probabilistic identification matrices.用于概率识别矩阵开发与评估的软件。
Comput Appl Biosci. 1991 Apr;7(2):189-93. doi: 10.1093/bioinformatics/7.2.189.
9
Highly specific protein sequence motifs for genome analysis.用于基因组分析的高度特异性蛋白质序列基序。
Proc Natl Acad Sci U S A. 1998 May 26;95(11):5865-71. doi: 10.1073/pnas.95.11.5865.
10
Efficient exact motif discovery.高效精确的基序发现
Bioinformatics. 2009 Jun 15;25(12):i356-64. doi: 10.1093/bioinformatics/btp188.

引用本文的文献

1
BLAMM: BLAS-based algorithm for finding position weight matrix occurrences in DNA sequences on CPUs and GPUs.BLAMM:一种基于 BLAS 的算法,用于在 CPU 和 GPU 上的 DNA 序列中查找位置权重矩阵出现的情况。
BMC Bioinformatics. 2020 Mar 11;21(Suppl 2):81. doi: 10.1186/s12859-020-3348-6.
2
Fast sequence analysis based on diamond sampling.基于钻石采样的快速序列分析。
PLoS One. 2018 Jun 28;13(6):e0198922. doi: 10.1371/journal.pone.0198922. eCollection 2018.
3
Predicting physiologically relevant SH3 domain mediated protein-protein interactions in yeast.
预测酵母中生理相关的SH3结构域介导的蛋白质-蛋白质相互作用
Bioinformatics. 2016 Jun 15;32(12):1865-72. doi: 10.1093/bioinformatics/btw045. Epub 2016 Feb 9.
4
The distribution of GYR- and YLP-like motifs in Drosophila suggests a general role in cuticle assembly and other protein-protein interactions.果蝇中 GYR- 和 YLP 样基序的分布表明其在表皮组装和其他蛋白质-蛋白质相互作用中具有普遍作用。
PLoS One. 2010 Sep 2;5(9):e12536. doi: 10.1371/journal.pone.0012536.
5
Significant speedup of database searches with HMMs by search space reduction with PSSM family models.利用 PSSM 家族模型缩小搜索空间,大大提高了 HMM 对数据库的搜索速度。
Bioinformatics. 2009 Dec 15;25(24):3251-8. doi: 10.1093/bioinformatics/btp593. Epub 2009 Oct 14.
6
MOODS: fast search for position weight matrix matches in DNA sequences.MOODS:在 DNA 序列中快速搜索位置权重矩阵匹配。
Bioinformatics. 2009 Dec 1;25(23):3181-2. doi: 10.1093/bioinformatics/btp554. Epub 2009 Sep 22.
7
Compound poisson approximation of the number of occurrences of a position frequency matrix (PFM) on both strands.位置频率矩阵(PFM)在两条链上出现次数的复合泊松近似。
J Comput Biol. 2008 Jul-Aug;15(6):547-64. doi: 10.1089/cmb.2007.0084.
8
Probabilistic inference of transcription factor binding from multiple data sources.基于多数据源的转录因子结合概率推断
PLoS One. 2008 Mar 26;3(3):e1820. doi: 10.1371/journal.pone.0001820.
9
A probabilistic method for small RNA flowgram matching.一种用于小RNA测序峰图匹配的概率方法。
Pac Symp Biocomput. 2008:75-86.
10
Efficient and accurate P-value computation for Position Weight Matrices.位置权重矩阵的高效准确P值计算。
Algorithms Mol Biol. 2007 Dec 11;2:15. doi: 10.1186/1748-7188-2-15.