使用结构相似性测试序列比较方法的统计显著性分数。

Testing statistical significance scores of sequence comparison methods with structure similarity.

作者信息

Hulsen Tim, de Vlieg Jacob, Leunissen Jack A M, Groenen Peter M A

机构信息

Centre for Molecular and Biomolecular Informatics, Nijmegen Centre for Molecular Life Sciences, Radboud University Nijmegen Medical Centre, Nijmegen, The Netherlands.

出版信息

BMC Bioinformatics. 2006 Oct 12;7:444. doi: 10.1186/1471-2105-7-444.

DOI:10.1186/1471-2105-7-444

PMID:17038163

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC1618413/

Abstract

BACKGROUND

In the past years the Smith-Waterman sequence comparison algorithm has gained popularity due to improved implementations and rapidly increasing computing power. However, the quality and sensitivity of a database search is not only determined by the algorithm but also by the statistical significance testing for an alignment. The e-value is the most commonly used statistical validation method for sequence database searching. The CluSTr database and the Protein World database have been created using an alternative statistical significance test: a Z-score based on Monte-Carlo statistics. Several papers have described the superiority of the Z-score as compared to the e-value, using simulated data. We were interested if this could be validated when applied to existing, evolutionary related protein sequences.

RESULTS

All experiments are performed on the ASTRAL SCOP database. The Smith-Waterman sequence comparison algorithm with both e-value and Z-score statistics is evaluated, using ROC, CVE and AP measures. The BLAST and FASTA algorithms are used as reference. We find that two out of three Smith-Waterman implementations with e-value are better at predicting structural similarities between proteins than the Smith-Waterman implementation with Z-score. SSEARCH especially has very high scores.

CONCLUSION

The compute intensive Z-score does not have a clear advantage over the e-value. The Smith-Waterman implementations give generally better results than their heuristic counterparts. We recommend using the SSEARCH algorithm combined with e-values for pairwise sequence comparisons.

摘要

背景

在过去几年中，由于实现方式的改进和计算能力的迅速提升，史密斯-沃特曼序列比对算法越来越受欢迎。然而，数据库搜索的质量和灵敏度不仅取决于算法，还取决于比对的统计显著性检验。期望值（e值）是序列数据库搜索中最常用的统计验证方法。CluSTr数据库和蛋白质世界数据库是使用另一种统计显著性检验创建的：基于蒙特卡洛统计的Z分数。几篇论文使用模拟数据描述了Z分数相对于e值的优越性。我们想知道，当应用于现有的、进化相关的蛋白质序列时，这一点是否能够得到验证。

结果

所有实验均在ASTRAL SCOP数据库上进行。使用ROC、CVE和AP指标评估了具有期望值和Z分数统计的史密斯-沃特曼序列比对算法。BLAST和FASTA算法用作参考。我们发现，三分之二具有期望值的史密斯-沃特曼实现方式在预测蛋白质之间的结构相似性方面比具有Z分数的史密斯-沃特曼实现方式更好。特别是SSEARCH得分非常高。