Department of Bioinformatics, Institute of Microbiology and Genetics.
Center for Computational Sciences, University of Goettingen, Goettingen, Germany.
Bioinformatics. 2019 Jan 15;35(2):211-218. doi: 10.1093/bioinformatics/bty592.
Most methods for pairwise and multiple genome alignment use fast local homology search tools to identify anchor points, i.e. high-scoring local alignments of the input sequences. Sequence segments between those anchor points are then aligned with slower, more sensitive methods. Finding suitable anchor points is therefore crucial for genome sequence comparison; speed and sensitivity of genome alignment depend on the underlying anchoring methods.
In this article, we use filtered spaced word matches to generate anchor points for genome alignment. For a given binary pattern representing match and don't-care positions, we first search for spaced-word matches, i.e. ungapped local pairwise alignments with matching nucleotides at the match positions of the pattern and possible mismatches at the don't-care positions. Those spaced-word matches that have similarity scores above some threshold value are then extended using a standard X-drop algorithm; the resulting local alignments are used as anchor points. To evaluate this approach, we used the popular multiple-genome-alignment pipeline Mugsy and replaced the exact word matches that Mugsy uses as anchor points with our spaced-word-based anchor points. For closely related genome sequences, the two anchoring procedures lead to multiple alignments of similar quality. For distantly related genomes, however, alignments calculated with our filtered-spaced-word matches are superior to alignments produced with the original Mugsy program where exact word matches are used to find anchor points.
http://spacedanchor.gobics.de.
Supplementary data are available at Bioinformatics online.
大多数用于两两和多个基因组比对的方法使用快速局部同源搜索工具来识别锚点,即输入序列的高分局部比对。然后,在这些锚点之间的序列段使用较慢、更敏感的方法进行比对。因此,找到合适的锚点对于基因组序列比较至关重要;基因组比对的速度和灵敏度取决于基础的锚定方法。
在本文中,我们使用过滤的间隔字匹配来生成基因组比对的锚点。对于表示匹配和不关心位置的二进制模式,我们首先搜索间隔字匹配,即具有匹配核苷酸的无间隙局部成对比对模式的匹配位置和可能的不关心位置的错配。那些相似度得分高于某个阈值的间隔字匹配然后使用标准的 X -drop 算法进行扩展;由此产生的局部比对用作锚点。为了评估这种方法,我们使用了流行的多基因组比对管道 Mugsy,并将 Mugsy 用作锚点的精确字匹配替换为我们基于间隔字的锚点。对于密切相关的基因组序列,这两种锚定过程导致相似质量的多重比对。然而,对于远距离相关的基因组,使用过滤间隔字匹配计算的比对优于使用原始 Mugsy 程序生成的比对,其中使用精确字匹配来找到锚点。
http://spacedanchor.gobics.de。
补充数据可在生物信息学在线获得。