Agrawal Ankit, Huang Xiaoqiu
Department of Computer Science, Iowa State University, 226 Atanasoff Hall, Ames, IA 50011-1041, USA.
BMC Bioinformatics. 2009 Mar 19;10 Suppl 3(Suppl 3):S1. doi: 10.1186/1471-2105-10-S3-S1.
Accurate estimation of statistical significance of a pairwise alignment is an important problem in sequence comparison. Recently, a comparative study of pairwise statistical significance with database statistical significance was conducted. In this paper, we extend the earlier work on pairwise statistical significance by incorporating with it the use of multiple parameter sets.
Results for a knowledge discovery application of homology detection reveal that using multiple parameter sets for pairwise statistical significance estimates gives better coverage than using a single parameter set, at least at some error levels. Further, the results of pairwise statistical significance using multiple parameter sets are shown to be significantly better than database statistical significance estimates reported by BLAST and PSI-BLAST, and comparable and at times significantly better than SSEARCH. Using non-zero parameter set change penalty values give better performance than zero penalty.
The fact that the homology detection performance does not degrade when using multiple parameter sets is a strong evidence for the validity of the assumption that the alignment score distribution follows an extreme value distribution even when using multiple parameter sets. Parameter set change penalty is a useful parameter for alignment using multiple parameter sets. Pairwise statistical significance using multiple parameter sets can be effectively used to determine the relatedness of a (or a few) pair(s) of sequences without performing a time-consuming database search.
准确估计成对序列比对的统计显著性是序列比较中的一个重要问题。最近,有人对成对统计显著性与数据库统计显著性进行了比较研究。在本文中,我们通过结合使用多个参数集来扩展早期关于成对统计显著性的工作。
同源性检测知识发现应用的结果表明,至少在某些错误水平下,使用多个参数集进行成对统计显著性估计比使用单个参数集具有更好的覆盖率。此外,使用多个参数集的成对统计显著性结果显示明显优于BLAST和PSI-BLAST报告的数据库统计显著性估计,并且与SSEARCH相当,有时甚至明显更好。使用非零参数集变化惩罚值比零惩罚具有更好的性能。
使用多个参数集时同源性检测性能不会下降这一事实有力地证明了即使使用多个参数集,比对得分分布仍遵循极值分布这一假设的有效性。参数集变化惩罚是使用多个参数集进行比对时的一个有用参数。使用多个参数集的成对统计显著性可有效地用于确定一对(或几对)序列的相关性,而无需进行耗时的数据库搜索。