Computational Biology Research Center, Institute for Advanced Industrial Science and Technology, Sequence Analysis Team, 2-4-7 Aomi, Koto-ku, Tokyo 135-0064, Japan.
Nucleic Acids Res. 2011 Mar;39(4):e23. doi: 10.1093/nar/gkq1212. Epub 2010 Nov 24.
Biological sequences are often analyzed by detecting homologous regions between them. Homology search is confounded by simple repeats, which give rise to strong similarities that are not homologies. Standard repeat-masking methods fail to eliminate this problem, and they are especially ill-suited to AT-rich DNA such as malaria and slime-mould genomes. We present a new repeat-masking method, TANTAN, which is motivated by the mechanisms that create simple repeats. This method thoroughly eliminates spurious homology predictions for DNA-DNA, protein-protein and DNA-protein comparisons. Moreover, it enables accurate homology search for non-coding DNA with extreme A + T composition.
生物序列通常通过检测它们之间的同源区域来进行分析。简单重复序列会干扰同源性搜索,因为它们会产生很强的相似性,但并不是真正的同源性。标准的重复屏蔽方法无法解决这个问题,特别是对于富含 AT 的 DNA,如疟疾和粘菌基因组。我们提出了一种新的重复屏蔽方法 TANTAN,它是受产生简单重复序列的机制启发而来的。这种方法可以彻底消除 DNA-DNA、蛋白质-蛋白质和 DNA-蛋白质比较中的虚假同源性预测。此外,它还可以实现对具有极端 A+T 组成的非编码 DNA 的精确同源性搜索。