Stockholm Bioinformatics Center, Stockholm University, SE-10691 Stockholm, Sweden.
Bioinformatics. 2009 Oct 1;25(19):2500-5. doi: 10.1093/bioinformatics/btp446. Epub 2009 Jul 20.
Low-complexity sequence regions present a common problem in finding true homologs to a protein query sequence. Several solutions to this have been suggested, but a detailed comparison between these on challenging data has so far been lacking. A common benchmark for homology detection procedures is to use SCOP/ASTRAL domain sequences belonging to the same or different superfamilies, but these contain almost no low complexity sequences.
We here introduce an alternative benchmarking strategy based around Pfam domains and clans on whole-proteome data sets. This gives a realistic level of low complexity sequences. We used it to evaluate all six built-in BLAST low complexity filter settings as well as a range of settings in the MSPcrunch post-processing filter. The effect on alignment length was also assessed.
Score matrix adjustment methods provide a low false positive rate at a relatively small loss in sensitivity relative to no filtering, across the range of test conditions we apply. MSPcrunch achieved even less loss in sensitivity, but at a higher false positive rate. A drawback of the score matrix adjustment methods is however that the alignments often become truncated.
Perl scripts for MSPcrunch BLAST filtering and for generating the benchmark dataset are available at http://sonnhammer.sbc.su.se/download/software/MSPcrunch+Blixem/benchmark.tar.gz
在寻找蛋白质查询序列的真正同源物时,低复杂度序列区域是一个常见的问题。已经提出了几种解决此问题的方法,但迄今为止,这些方法在具有挑战性的数据上的详细比较还很缺乏。同源性检测程序的一个常见基准是使用属于同一或不同超家族的 SCOP/ASTRAL 结构域序列,但这些序列几乎不含低复杂度序列。
我们在这里引入了一种基于 Pfam 结构域和全蛋白质数据集上的族的替代基准测试策略。这提供了一个真实的低复杂度序列水平。我们使用它来评估所有内置 BLAST 低复杂度过滤器设置以及 MSPcrunch 后处理过滤器中的一系列设置。还评估了对对齐长度的影响。
评分矩阵调整方法在相对较小的敏感性损失下提供了较低的假阳性率,与不进行过滤相比,在我们应用的测试条件范围内都是如此。MSPcrunch 甚至以更高的假阳性率实现了更低的敏感性损失。然而,评分矩阵调整方法的一个缺点是,对齐通常会变得截断。
用于 MSPcrunch BLAST 过滤和生成基准数据集的 Perl 脚本可在 http://sonnhammer.sbc.su.se/download/software/MSPcrunch+Blixem/benchmark.tar.gz 获得。