Department of Computer Science, Iowa State University, 226 Atanasoff Hall, Ames, IA 50011-1041, USA.
IEEE/ACM Trans Comput Biol Bioinform. 2011 Jan-Mar;8(1):194-205. doi: 10.1109/TCBB.2009.69.
Pairwise sequence alignment is a central problem in bioinformatics, which forms the basis of various other applications. Two related sequences are expected to have a high alignment score, but relatedness is usually judged by statistical significance rather than by alignment score. Recently, it was shown that pairwise statistical significance gives promising results as an alternative to database statistical significance for getting individual significance estimates of pairwise alignment scores. The improvement was mainly attributed to making the statistical significance estimation process more sequence-specific and database-independent. In this paper, we use sequence-specific and position-specific substitution matrices to derive the estimates of pairwise statistical significance, which is expected to use more sequence-specific information in estimating pairwise statistical significance. Experiments on a benchmark database with sequence-specific substitution matrices at different levels of sequence-specific contribution were conducted, and results confirm that using sequence-specific substitution matrices for estimating pairwise statistical significance is significantly better than using a standard matrix like BLOSUM62, and than database statistical significance estimates reported by popular database search programs like BLAST, PSI-BLAST (without pretrained PSSMs), and SSEARCH on a benchmark database, but with pretrained PSSMs, PSI-BLAST results are significantly better. Further, using position-specific substitution matrices for estimating pairwise statistical significance gives significantly better results even than PSI-BLAST using pretrained PSSMs.
序列比对是生物信息学中的一个核心问题,它是许多其他应用的基础。人们期望相关的两个序列具有较高的比对得分,但相关性通常是通过统计显著性来判断,而不是通过比对得分。最近,有人表明,对于获得两两比对得分的个体显著性估计,两两统计显著性可以替代数据库统计显著性,作为一种替代方法,它具有很好的效果。这种改进主要归因于使统计显著性估计过程更加序列特异性和数据库独立性。在本文中,我们使用序列特异性和位置特异性替换矩阵来推导两两统计显著性的估计值,预计这将在估计两两统计显著性时使用更多的序列特异性信息。在具有不同序列特异性贡献水平的序列特异性替换矩阵的基准数据库上进行了实验,结果证实,使用序列特异性替换矩阵来估计两两统计显著性明显优于使用像 BLOSUM62 这样的标准矩阵,也优于流行的数据库搜索程序(如 BLAST、PSI-BLAST(无预训练 PSSM)和 SSEARCH)在基准数据库上报告的数据库统计显著性估计值,但具有预训练 PSSM 的 PSI-BLAST 结果要好得多。此外,使用位置特异性替换矩阵来估计两两统计显著性甚至比 PSI-BLAST 使用预训练 PSSM 得到的结果要好得多。