Wu T D, Nevill-Manning C G, Brutlag D L
Department of Biochemistry, Stanford University School of Medicine, Stanford, CA, USA.
Bioinformatics. 2000 Mar;16(3):233-44. doi: 10.1093/bioinformatics/16.3.233.
We present techniques for increasing the speed of sequence analysis using scoring matrices. Our techniques are based on calculating, for a given scoring matrix, the quantile function, which assigns a probability, or p, value to each segmental score. Our techniques also permit the user to specify a p threshold to indicate the desired trade-off between sensitivity and speed for a particular sequence analysis. The resulting increase in speed should allow scoring matrices to be used more widely in large-scale sequencing and annotation projects.
We develop three techniques for increasing the speed of sequence analysis: probability filtering, lookahead scoring, and permuted lookahead scoring. In probability filtering, we compute the score threshold that corresponds to the user-specified p threshold. We use the score threshold to limit the number of segments that are retained in the search process. In lookahead scoring, we test intermediate scores to determine whether they will possibly exceed the score threshold. In permuted lookahead scoring, we score each segment in a particular order designed to maximize the likelihood of early termination. Our two lookahead scoring techniques reduce substantially the number of residues that must be examined. The fraction of residues examined ranges from 62 to 6%, depending on the p threshold chosen by the user. These techniques permit sequence analysis with scoring matrices at speeds that are several times faster than existing programs. On a database of 12 177 alignment blocks, our techniques permit sequence analysis at a speed of 225 residues/s for a p threshold of 10-6, and 541 residues/s for a p threshold of 10-20. In order to compute the quantile function, we may use either an independence assumption or a Markov assumption. We measure the effect of first- and second-order Markov assumptions and find that they tend to raise the p value of segments, when compared with the independence assumption, by average ratios of 1.30 and 1.69, respectively. We also compare our technique with the empirical 99. 5th percentile scores compiled in the BLOCKSPLUS database, and find that they correspond on average to a p value of 1.5 x 10-5.
The techniques described above are implemented in a software package called EMATRIX. This package is available from the authors for free academic use or for licensed commercial use. The EMATRIX set of programs is also available on the Internet at http://motif.stanford.edu/ematrix.
我们提出了使用评分矩阵提高序列分析速度的技术。我们的技术基于为给定的评分矩阵计算分位数函数,该函数为每个片段得分赋予一个概率或p值。我们的技术还允许用户指定一个p阈值,以表明在特定序列分析中灵敏度和速度之间所需的权衡。速度的提高应使评分矩阵在大规模测序和注释项目中得到更广泛的应用。
我们开发了三种提高序列分析速度的技术:概率过滤、前瞻评分和置换前瞻评分。在概率过滤中,我们计算与用户指定的p阈值相对应的得分阈值。我们使用得分阈值来限制在搜索过程中保留的片段数量。在前瞻评分中,我们测试中间得分以确定它们是否可能超过得分阈值。在置换前瞻评分中,我们以特定顺序对每个片段进行评分,以最大限度地提高提前终止的可能性。我们的两种前瞻评分技术大大减少了必须检查的残基数量。根据用户选择的p阈值,检查的残基比例范围从62%到6%。这些技术允许使用评分矩阵进行序列分析,速度比现有程序快几倍。在一个包含12177个比对块的数据库上,对于p阈值为10^-6,我们的技术允许以225个残基/秒的速度进行序列分析,对于p阈值为10^-20,则为541个残基/秒。为了计算分位数函数,我们可以使用独立性假设或马尔可夫假设。我们测量了一阶和二阶马尔可夫假设的影响,发现与独立性假设相比,它们往往会使片段的p值分别平均提高1.30和1.69倍。我们还将我们的技术与BLOCKSPLUS数据库中汇编的经验第99.5百分位数得分进行了比较,发现它们平均对应于1.5×10^-5的p值。
上述技术在一个名为EMATRIX的软件包中实现。该软件包可从作者处获得,供学术免费使用或商业许可使用。EMATRIX程序集也可在互联网上获取,网址为http://motif.stanford.edu/ematrix。