Suppr超能文献

序列相似性搜索的经验性统计估计。

Empirical statistical estimates for sequence similarity searches.

作者信息

Pearson W R

机构信息

Department of Biochemistry, University of Virginia, Charlottesville 22908, USA.

出版信息

J Mol Biol. 1998 Feb 13;276(1):71-84. doi: 10.1006/jmbi.1997.1525.

Abstract

The FASTA package of sequence comparison programs has been modified to provide accurate statistical estimates for local sequence similarity scores with gaps. These estimates are derived using the extreme value distribution from the mean and variance of the local similarity scores of unrelated sequences after the scores have been corrected for the expected effect of library sequence length. This approach allows accurate estimates to be calculated for both FASTA and Smith-Waterman similarity scores for protein/protein, DNA/DNA, and protein/translated-DNA comparisons. The accuracy of the statistical estimates is summarized for 54 protein families using FASTA and Smith-Waterman scores. Probability estimates calculated from the distribution of similarity scores are generally conservative, as are probabilities calculated using the Altschul-Gish lambda, kappa, and eta parameters. The performance of several alternative methods for correcting similarity scores for library-sequence length was evaluated using 54 protein superfamilies from the PIR39 database and 110 protein families from the Prosite/SwissProt rel. 34 database. Both regression-scaled and Altschul-Gish scaled scores perform significantly better than unscaled Smith-Waterman or FASTA similarity scores. When the Prosite/ SwissProt test set is used, regression-scaled scores perform slightly better; when the PIR database is used, Altschul-Gish scaled scores perform best. Thus, length-corrected similarity scores improve the sensitivity of database searches. Statistical parameters that are derived from the distribution of similarity scores from the thousands of unrelated sequences typically encountered in a database search provide accurate estimates of statistical significance that can be used to infer sequence homology.

摘要

序列比较程序的FASTA软件包已被修改,以提供带空位的局部序列相似性得分的准确统计估计。这些估计是在对文库序列长度的预期影响进行校正后,根据无关序列局部相似性得分的均值和方差,利用极值分布得出的。这种方法能够为蛋白质/蛋白质、DNA/DNA以及蛋白质/翻译DNA比较中的FASTA和Smith-Waterman相似性得分计算出准确的估计值。使用FASTA和Smith-Waterman得分总结了54个蛋白质家族的统计估计准确性。根据相似性得分分布计算出的概率估计通常较为保守,使用Altschul-Gish的lambda、kappa和eta参数计算出的概率也是如此。使用来自PIR39数据库的54个蛋白质超家族和来自Prosite/SwissProt rel. 数据库的110个蛋白质家族,评估了几种校正文库序列长度相似性得分的替代方法的性能。回归缩放得分和Altschul-Gish缩放得分的表现均显著优于未缩放的Smith-Waterman或FASTA相似性得分。当使用Prosite/SwissProt测试集时,回归缩放得分表现稍好;当使用PIR数据库时,Altschul-Gish缩放得分表现最佳。因此,经长度校正的相似性得分提高了数据库搜索的灵敏度。从数据库搜索中通常会遇到的数千个无关序列的相似性得分分布中得出的统计参数,提供了可用于推断序列同源性的统计显著性的准确估计。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验