Al-Abri Rashid, Gürsoy Gamze
Department of Computer Science, Columbia University, New York, USA.
New York Genome Center, New York, USA.
bioRxiv. 2025 Feb 20:2025.02.15.638440. doi: 10.1101/2025.02.15.638440.
Tandem repeats (TRs) are sequences of DNA where two or more base pairs are repeated back-to-back at specific locations in the genome. The expansions of TRs are implicated in over 50 conditions, including Friedreich's ataxia, autism, and cancer. However, accurately measuring the copy number of TRs is challenging, especially when their expansions are larger than the fragment sizes used in standard short-read genome sequencing. Here we introduce ScatTR, a novel computational method that leverages a maximum likelihood framework to estimate the copy number of large TR expansions from short-read sequencing data. ScatTR calculates the likelihood of different alignments between sequencing reads and reference sequences that represent various TR lengths and employs a Monte Carlo technique to find the best match. In simulated data, ScatTR outperforms state-of-the-art methods, particularly for TRs with longer motifs and those with lengths that greatly exceed typical sequencing fragment sizes. When applied to data from the 1000 Genomes Project, ScatTR detected potential large TR expansions that other methods missed, highlighting its ability to better identify genome-wide characterization of TR variation. ScatTR can be accessed via: https://github.com/g2lab/scattr.
串联重复序列(TRs)是基因组中特定位置上两个或更多碱基对首尾相连重复出现的DNA序列。TRs的扩增与50多种疾病相关,包括弗里德赖希共济失调、自闭症和癌症。然而,准确测量TRs的拷贝数具有挑战性,尤其是当它们的扩增大于标准短读长基因组测序中使用的片段大小时。在此,我们介绍了ScatTR,这是一种新颖的计算方法,它利用最大似然框架从短读长测序数据中估计大型TR扩增的拷贝数。ScatTR计算测序读段与代表各种TR长度的参考序列之间不同比对的似然性,并采用蒙特卡罗技术来找到最佳匹配。在模拟数据中,ScatTR优于现有方法,特别是对于具有较长基序的TRs以及长度大大超过典型测序片段大小的TRs。当应用于千人基因组计划的数据时,ScatTR检测到了其他方法遗漏的潜在大型TR扩增,突出了其更好地识别TR变异全基因组特征的能力。可通过以下链接访问ScatTR:https://github.com/g2lab/scattr。