Kawahara Riki, Morishita Shinichi
Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Chiba 277-8562, Japan.
Bioinformatics. 2025 Mar 29;41(4). doi: 10.1093/bioinformatics/btaf155.
Extended tandem repeats (TRs) have been associated with 60 or more diseases over the past 30 years. Although most TRs have single repeat units (or motifs), complex TRs with different units have recently been correlated with some brain disorders. Of note, a population-scale analysis shows that complex TRs at one locus can be divergent, and different units are often expanded between individuals. To understand the evolution of high TR diversity, it is informative to visualize a phylogenetic tree. To do this, we need to measure the edit distance between pairs of complex TRs by considering duplication and contraction of units created by replication slippage. However, traditional rigorous algorithms for this purpose are computationally expensive.
We here propose an efficient heuristic algorithm to estimate the edit distance with duplication and contraction of units (EDDC, for short). We select a set of frequent units that occur in given complex TRs, encode each unit as a single symbol, compress a TR into an optimal series of unit symbols that partially matches the original TR with the minimum Levenshtein distance, and estimate the EDDC between a pair of complex TRs from their compressed forms. Using substantial synthetic benchmark datasets, we demonstrate that the estimated EDDC is highly correlated with the accurate EDDC, with a Pearson correlation coefficient of >0.983, while the heuristic algorithm achieves orders of magnitude performance speedup.
The software program hEDDC that implements the proposed algorithm is available at https://github.com/Ricky-pon/hEDDC (DOI: 10.5281/zenodo.14732958).
在过去30年中,扩展串联重复序列(TRs)已与60多种疾病相关联。尽管大多数TRs具有单个重复单元(或基序),但具有不同单元的复杂TRs最近已与某些脑部疾病相关。值得注意的是,一项群体规模分析表明,一个位点的复杂TRs可能存在差异,并且不同单元在个体之间经常会发生扩展。为了理解高TR多样性的进化,可视化系统发育树会很有帮助。为此,我们需要通过考虑复制滑动产生的单元的重复和收缩来测量成对复杂TRs之间的编辑距离。然而,用于此目的的传统严格算法在计算上成本很高。
我们在此提出一种高效的启发式算法,用于估计单元重复和收缩情况下的编辑距离(简称为EDDC)。我们选择一组在给定复杂TRs中出现的频繁单元,将每个单元编码为单个符号,将一个TR压缩为一系列最优的单元符号,这些符号与原始TR部分匹配且具有最小的莱文斯坦距离,并根据其压缩形式估计一对复杂TRs之间的EDDC。使用大量合成基准数据集,我们证明估计的EDDC与准确的EDDC高度相关,皮尔逊相关系数>0.983,而该启发式算法实现了数量级的性能加速。
实现所提出算法的软件程序hEDDC可在https://github.com/Ricky-pon/hEDDC(DOI:10.5281/zenodo.14732958)获得。