Graduate Program in Bioinformatics and Systems Biology, University of California, San Diego, La Jolla, CA, USA.
Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA, USA.
Nat Methods. 2023 Sep;20(9):1346-1354. doi: 10.1038/s41592-023-01970-4. Epub 2023 Aug 14.
Even though the recent advances in 'complete genomics' revealed the previously inaccessible genomic regions, analysis of variations in centromeres and other extra-long tandem repeats (ETRs) faces an algorithmic challenge since there are currently no tools for accurate sequence comparison of ETRs. Counterintuitively, the classical alignment approaches, such as the Smith-Waterman algorithm, fail to construct biologically adequate alignments of ETRs. We present UniAligner-the parameter-free sequence alignment algorithm with sequence-dependent alignment scoring that automatically changes for any pair of compared sequences. UniAligner prioritizes matches of rare substrings that are more likely to be relevant to the evolutionary relationship between two sequences. We apply UniAligner to estimate the mutation rates in human centromeres, and quantify the extremely high rate of large duplications and deletions in centromeres. This high rate suggests that centromeres may represent some of the most rapidly evolving regions of the human genome with respect to their structural organization.
尽管“完整基因组学”的最新进展揭示了以前无法获得的基因组区域,但由于目前没有用于准确比较 ETR 序列的工具,因此分析着丝粒和其他超长串联重复(ETR)的变异面临算法挑战。具有讽刺意味的是,经典的比对方法(如 Smith-Waterman 算法)无法构建 ETR 的生物学上适当的比对。我们提出了 UniAligner,这是一种无参数的序列比对算法,具有依赖于序列的比对评分,可针对任何一对比较序列自动更改。UniAligner 优先考虑罕见子字符串的匹配,这些子字符串更有可能与两个序列之间的进化关系相关。我们应用 UniAligner 来估计人类着丝粒中的突变率,并量化着丝粒中非常高的大重复和缺失率。这种高速率表明,相对于其结构组织,着丝粒可能是人类基因组中进化最快的区域之一。