Morgulis Aleksandr, Gertz E Michael, Schäffer Alejandro A, Agarwala Richa
National Center for Biotechnology Information, National Institutes of Health, Department of Health and Human Services, Bethesda, Maryland 20894 USA.
J Comput Biol. 2006 Jun;13(5):1028-40. doi: 10.1089/cmb.2006.13.1028.
The DUST module has been used within BLAST for many years to mask low-complexity sequences. In this paper, we present a new implementation of the DUST module that uses the same function to assign a complexity score to a sequence, but uses a different rule by which high-scoring sequences are masked. The new rule masks every nucleotide masked by the old rule and occasionally masks more. The new masking rule corrects two related deficiencies with the old rule. First, the new rule is symmetric with respect to reversing the sequence. Second, the new rule is not context sensitive; the decision to mask a subsequence does not depend on what sequences flank it. The new implementation is at least four times faster than the old on the human genome. We show that both the percentage of additional bases masked and the effect on MegaBLAST outputs are very small.
多年来,DUST模块一直在BLAST中用于屏蔽低复杂度序列。在本文中,我们展示了DUST模块的一种新实现方式,它使用相同的函数为序列分配复杂度分数,但使用不同的规则来屏蔽高分序列。新规则会屏蔽旧规则所屏蔽的每个核苷酸,并且偶尔会屏蔽更多核苷酸。新的屏蔽规则纠正了旧规则的两个相关缺陷。第一,新规则在序列反转方面是对称的。第二,新规则不依赖上下文;屏蔽一个子序列的决定不取决于其两侧的序列。在人类基因组上,新实现方式的速度至少比旧方式快四倍。我们表明,额外屏蔽的碱基百分比以及对MegaBLAST输出的影响都非常小。