Olson Daniel, Wheeler Travis
University of Montana, Missoula, Montana
ACM BCB. 2018 Aug-Sep;2018:37-46. doi: 10.1145/3233547.3233604.
In biological sequences, tandem repeats consist of tens to hundreds of residues of a repeated pattern, such as atgatgatgatgatg ('atg' repeated), often the result of replication slippage. Over time, these repeats decay so that the original sharp pattern of repetition is somewhat obscured, but even degenerate repeats pose a problem for sequence annotation: when two sequences both contain shared patterns of similar repetition, the result can be a false signal of sequence homology. We describe an implementation of a new hidden Markov model for detecting tandem repeats that shows substantially improved sensitivity to labeling decayed repetitive regions, presents low and reliable false annotation rates across a wide range of sequence composition, and produces scores that follow a stable distribution. On typical genomic sequence, the time and memory requirements of the resulting tool () are competitive with the most heavily used tool for repeat masking (). is released under an open source license and lays the groundwork for inclusion of the model in sequence alignment tools and annotation pipelines.
在生物序列中,串联重复由数十到数百个重复模式的残基组成,例如atgatgatgatgatg(“atg”重复),这通常是复制滑移的结果。随着时间的推移,这些重复会逐渐衰减,以至于最初清晰的重复模式会有些模糊,但即使是退化的重复也会给序列注释带来问题:当两个序列都包含相似重复的共享模式时,结果可能是序列同源性的假信号。我们描述了一种用于检测串联重复的新隐马尔可夫模型的实现,该模型对标记衰减的重复区域具有显著提高的灵敏度,在广泛的序列组成范围内呈现出低且可靠的错误注释率,并产生遵循稳定分布的分数。在典型的基因组序列上,所得工具()的时间和内存要求与用于重复掩码的使用最频繁的工具()具有竞争力。该工具在开源许可下发布,并为将该模型纳入序列比对工具和注释管道奠定了基础。