IEEE/ACM Trans Comput Biol Bioinform. 2022 Jul-Aug;19(4):2080-2091. doi: 10.1109/TCBB.2021.3059239. Epub 2022 Aug 8.
Tandem repeats are repetitive structures present in some DNA sequences, consisting of many repeated copies of a single motif. They can serve as important markers for phylogenetic and population genetic studies, due to the high polymorphism in the number of motif copies as well as variations in the motif. The first step in using tandem repeats for phylogenetic studies is to estimate the evolutionary distance between a pair D and D of tandem repeat sequences with homologous motifs. This problem can be broken into two sub-problems: 1) Construct the most recent common ancestor of the sequences. 2) Calculate the evolutionary distance between each sequence and the hypothesised common ancestor. We present an algorithm that estimates the solution to the second problem. This takes the form of an asymmetric alignment algorithm to estimate the evolutionary distance between two tandem repeat sequences A and D, where D is assumed to have descended from A, under a model that allows block duplication, deletion, and variant substitution. The algorithm is asymmetric in the sense that the two input sequences A and D play different roles in the calculations, reflecting the assumption that D descends from A. Our model assumes static motif boundaries, meaning that motif duplication and deletion events must respect the motif boundaries. The algorithm may also be applied without modification to more complex repetitive structures with two or more motifs, such as nested tandem repeats.
串联重复是存在于一些 DNA 序列中的重复结构,由单个基序的许多重复拷贝组成。由于基序拷贝数的高度多态性以及基序的变化,它们可以作为系统发育和群体遗传学研究的重要标记。在使用串联重复进行系统发育研究的第一步是估计具有同源基序的一对串联重复序列 D 和 D 之间的进化距离。这个问题可以分为两个子问题:1)构建序列的最近共同祖先。2)计算每个序列与假设的共同祖先之间的进化距离。我们提出了一种估计第二个问题解决方案的算法。这是一种不对称对齐算法,用于在允许块重复、删除和变体替换的模型下,估计两个串联重复序列 A 和 D 之间的进化距离,其中假设 D 是从 A 衍生而来的。该算法在不对称的意义上,即两个输入序列 A 和 D 在计算中扮演不同的角色,反映了 D 从 A 衍生而来的假设。我们的模型假设基序边界是静态的,这意味着基序复制和删除事件必须遵守基序边界。该算法也可以不经修改应用于具有两个或更多基序的更复杂重复结构,例如嵌套串联重复。