Hulsen Tim, de Vlieg Jacob, Leunissen Jack A M, Groenen Peter M A
Centre for Molecular and Biomolecular Informatics, Nijmegen Centre for Molecular Life Sciences, Radboud University Nijmegen Medical Centre, Nijmegen, The Netherlands.
BMC Bioinformatics. 2006 Oct 12;7:444. doi: 10.1186/1471-2105-7-444.
In the past years the Smith-Waterman sequence comparison algorithm has gained popularity due to improved implementations and rapidly increasing computing power. However, the quality and sensitivity of a database search is not only determined by the algorithm but also by the statistical significance testing for an alignment. The e-value is the most commonly used statistical validation method for sequence database searching. The CluSTr database and the Protein World database have been created using an alternative statistical significance test: a Z-score based on Monte-Carlo statistics. Several papers have described the superiority of the Z-score as compared to the e-value, using simulated data. We were interested if this could be validated when applied to existing, evolutionary related protein sequences.
All experiments are performed on the ASTRAL SCOP database. The Smith-Waterman sequence comparison algorithm with both e-value and Z-score statistics is evaluated, using ROC, CVE and AP measures. The BLAST and FASTA algorithms are used as reference. We find that two out of three Smith-Waterman implementations with e-value are better at predicting structural similarities between proteins than the Smith-Waterman implementation with Z-score. SSEARCH especially has very high scores.
The compute intensive Z-score does not have a clear advantage over the e-value. The Smith-Waterman implementations give generally better results than their heuristic counterparts. We recommend using the SSEARCH algorithm combined with e-values for pairwise sequence comparisons.
在过去几年中,由于实现方式的改进和计算能力的迅速提升,史密斯-沃特曼序列比对算法越来越受欢迎。然而,数据库搜索的质量和灵敏度不仅取决于算法,还取决于比对的统计显著性检验。期望值(e值)是序列数据库搜索中最常用的统计验证方法。CluSTr数据库和蛋白质世界数据库是使用另一种统计显著性检验创建的:基于蒙特卡洛统计的Z分数。几篇论文使用模拟数据描述了Z分数相对于e值的优越性。我们想知道,当应用于现有的、进化相关的蛋白质序列时,这一点是否能够得到验证。
所有实验均在ASTRAL SCOP数据库上进行。使用ROC、CVE和AP指标评估了具有期望值和Z分数统计的史密斯-沃特曼序列比对算法。BLAST和FASTA算法用作参考。我们发现,三分之二具有期望值的史密斯-沃特曼实现方式在预测蛋白质之间的结构相似性方面比具有Z分数的史密斯-沃特曼实现方式更好。特别是SSEARCH得分非常高。
计算密集型的Z分数相对于期望值没有明显优势。史密斯-沃特曼实现方式通常比启发式对应方式产生更好的结果。我们建议使用SSEARCH算法结合期望值进行成对序列比对。