使用评分矩阵对序列功能进行快速概率分析。

Fast probabilistic analysis of sequence function using scoring matrices.

作者信息

Wu T D, Nevill-Manning C G, Brutlag D L

机构信息

Department of Biochemistry, Stanford University School of Medicine, Stanford, CA, USA.

出版信息

Bioinformatics. 2000 Mar;16(3):233-44. doi: 10.1093/bioinformatics/16.3.233.

DOI:10.1093/bioinformatics/16.3.233

PMID:10869016

Abstract

MOTIVATION

We present techniques for increasing the speed of sequence analysis using scoring matrices. Our techniques are based on calculating, for a given scoring matrix, the quantile function, which assigns a probability, or p, value to each segmental score. Our techniques also permit the user to specify a p threshold to indicate the desired trade-off between sensitivity and speed for a particular sequence analysis. The resulting increase in speed should allow scoring matrices to be used more widely in large-scale sequencing and annotation projects.

RESULTS

We develop three techniques for increasing the speed of sequence analysis: probability filtering, lookahead scoring, and permuted lookahead scoring. In probability filtering, we compute the score threshold that corresponds to the user-specified p threshold. We use the score threshold to limit the number of segments that are retained in the search process. In lookahead scoring, we test intermediate scores to determine whether they will possibly exceed the score threshold. In permuted lookahead scoring, we score each segment in a particular order designed to maximize the likelihood of early termination. Our two lookahead scoring techniques reduce substantially the number of residues that must be examined. The fraction of residues examined ranges from 62 to 6%, depending on the p threshold chosen by the user. These techniques permit sequence analysis with scoring matrices at speeds that are several times faster than existing programs. On a database of 12 177 alignment blocks, our techniques permit sequence analysis at a speed of 225 residues/s for a p threshold of 10-6, and 541 residues/s for a p threshold of 10-20. In order to compute the quantile function, we may use either an independence assumption or a Markov assumption. We measure the effect of first- and second-order Markov assumptions and find that they tend to raise the p value of segments, when compared with the independence assumption, by average ratios of 1.30 and 1.69, respectively. We also compare our technique with the empirical 99. 5th percentile scores compiled in the BLOCKSPLUS database, and find that they correspond on average to a p value of 1.5 x 10-5.

AVAILABILITY

The techniques described above are implemented in a software package called EMATRIX. This package is available from the authors for free academic use or for licensed commercial use. The EMATRIX set of programs is also available on the Internet at http://motif.stanford.edu/ematrix.

摘要

动机

我们提出了使用评分矩阵提高序列分析速度的技术。我们的技术基于为给定的评分矩阵计算分位数函数，该函数为每个片段得分赋予一个概率或p值。我们的技术还允许用户指定一个p阈值，以表明在特定序列分析中灵敏度和速度之间所需的权衡。速度的提高应使评分矩阵在大规模测序和注释项目中得到更广泛的应用。

结果

我们开发了三种提高序列分析速度的技术：概率过滤、前瞻评分和置换前瞻评分。在概率过滤中，我们计算与用户指定的p阈值相对应的得分阈值。我们使用得分阈值来限制在搜索过程中保留的片段数量。在前瞻评分中，我们测试中间得分以确定它们是否可能超过得分阈值。在置换前瞻评分中，我们以特定顺序对每个片段进行评分，以最大限度地提高提前终止的可能性。我们的两种前瞻评分技术大大减少了必须检查的残基数量。根据用户选择的p阈值，检查的残基比例范围从62%到6%。这些技术允许使用评分矩阵进行序列分析，速度比现有程序快几倍。在一个包含12177个比对块的数据库上，对于p阈值为10^-6，我们的技术允许以225个残基/秒的速度进行序列分析，对于p阈值为10^-20，则为541个残基/秒。为了计算分位数函数，我们可以使用独立性假设或马尔可夫假设。我们测量了一阶和二阶马尔可夫假设的影响，发现与独立性假设相比，它们往往会使片段的p值分别平均提高1.30和1.69倍。我们还将我们的技术与BLOCKSPLUS数据库中汇编的经验第99.5百分位数得分进行了比较，发现它们平均对应于1.5×10^-5的p值。