Suppr超能文献

来自过去的一次冲击:重新审视Blastp的E值。

A BLAST from the past: revisiting blastp's E-value.

作者信息

Lu Yang Young, Noble William Stafford, Keich Uri

机构信息

Cheriton School of Computer Science, University of Waterloo, Waterloo, ON N2L 3G1, Canada.

Department of Genome Sciences and Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA 98105, United States.

出版信息

Bioinformatics. 2024 Nov 28;40(12). doi: 10.1093/bioinformatics/btae729.

Abstract

MOTIVATION

The Basic Local Alignment Search Tool, BLAST, is an indispensable tool for genomic research. BLAST has established itself as the canonical tool for sequence similarity search in large part thanks to its meaningful statistical analysis. Specifically, BLAST reports the E-value of each reported alignment, which is defined as the expected number of optimal local alignments that will score at least as high as the observed alignment score, assuming that the query and the database sequences are randomly generated.

RESULTS

Here, we critically evaluate the E-values provided by the standard protein BLAST (blastp), showing that they can be at times significantly conservative while at others too liberal. We offer an alternative approach based on generating a small sample from the null distribution of random optimal alignments, and testing whether the observed alignment score is consistent with it. In contrast with blastp, our significance analysis seems valid, in the sense that it did not deliver inflated significance estimates in any of our extensive experiments. Moreover, although our method is slightly conservative, it is often significantly less so than the blastp E-value. Indeed, in cases where blastp's analysis is valid (i.e., not too liberal), our approach seems to deliver a greater number of correct alignments. One advantage of our approach is that it works with any reasonable choice of substitution matrix and gap penalties, avoiding blastp's limited options of matrices and penalties. In addition, we can formulate the problem using a canonical family-wise error rate control setup, thereby dispensing with E-values, which can at times be difficult to interpret.

AVAILABILITY AND IMPLEMENTATION

The Apache licensed source code is available at https://github.com/batmen-lab/SGPvalue.

摘要

动机

基本局部比对搜索工具BLAST是基因组研究中不可或缺的工具。BLAST在很大程度上已成为序列相似性搜索的标准工具,这得益于其有意义的统计分析。具体而言,BLAST会报告每个比对结果的E值,E值定义为假设查询序列和数据库序列是随机生成的情况下,至少与观察到的比对得分一样高的最优局部比对的预期数量。

结果

在此,我们对标准蛋白质BLAST(blastp)提供的E值进行了严格评估,结果表明它们有时可能显著保守,而有时又过于宽松。我们提供了一种替代方法,该方法基于从随机最优比对的零分布中生成一个小样本,并测试观察到的比对得分是否与之相符。与blastp不同,我们的显著性分析似乎是有效的,因为在我们所有广泛的实验中,它都没有给出过高的显著性估计。此外,尽管我们的方法略显保守,但通常比blastp的E值保守程度要低得多。事实上,在blastp分析有效的情况下(即不过于宽松),我们的方法似乎能给出更多正确的比对结果。我们方法的一个优点是,它适用于任何合理选择的替换矩阵和空位罚分,避免了blastp对矩阵和罚分的有限选择。此外,我们可以使用标准的家族性错误率控制设置来阐述这个问题,从而无需使用有时难以解释的E值。

可用性和实现方式

遵循Apache许可的源代码可在https://github.com/batmen-lab/SGPvalue获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/06f7/11652269/e7e91746765a/btae729f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验