Danzi Matt C, Xu Isaac R L, Fazal Sarah, Dolzhenko Egor, Pellerin David, Weisburd Ben, Reuter Chloe, Sampson Jacinda, Folland Chiara, Wheeler Matthew, O'Donnell-Luria Anne, Wuchty Stefan, Ravenscroft Gianina, Eberle Michael A, Zuchner Stephan
Dr. John T. Macdonald Foundation Department of Human Genetics and John P. Hussman Institute for Human Genomics, University of Miami Miller School of Medicine, Miami, FL, USA.
Pacific Biosciences, Menlo Park, CA, USA.
bioRxiv. 2025 Jan 20:2025.01.06.631535. doi: 10.1101/2025.01.06.631535.
Tandem repeats are a highly polymorphic class of genomic variation that play causal roles in rare diseases but are notoriously difficult to sequence using short-read techniques. Most previous studies profiling tandem repeats genome-wide have reduced the description of each locus to the singular value of the length of the entire repetitive locus. Here we introduce a comprehensive database of 3.6 billion tandem repeat allele sequences from over one thousand individuals using HiFi long-read sequencing. We show that the previously identified pathogenic loci are among the most variable tandem repeat loci in the genome, when incorporating nucleotide resolution sequence content to measure the longest pure motif segment. More broadly, we introduce a novel measure, 'tandem repeat constraint', that assists in distinguishing potentially pathogenic from benign loci. Our approach of measuring variation as 'the length of the longest pure segment' successfully prioritizes pathogenic repeats within their previously published linkage regions. We also present evidence for two novel pathogenic repeat expansion candidates. In summary, this analysis significantly clarifies the potential for short tandem repeat pathogenicity at over 1.7 million tandem repeat loci and will aid the identification of disease-causing repeat expansions.
串联重复序列是一类高度多态的基因组变异,在罕见疾病中起因果作用,但使用短读长技术进行测序非常困难。以前大多数在全基因组范围内分析串联重复序列的研究,都将每个位点的描述简化为整个重复位点长度的单一值。在这里,我们使用高保真长读长测序技术,引入了一个包含来自一千多个个体的36亿个串联重复序列等位基因的综合数据库。我们发现,当纳入核苷酸分辨率的序列内容以测量最长的纯基序片段时,先前确定的致病位点是基因组中最可变的串联重复位点之一。更广泛地说,我们引入了一种新的测量方法——“串联重复序列约束”,有助于区分潜在的致病位点和良性位点。我们将变异测量为“最长纯片段的长度”的方法,成功地在先前发表的连锁区域内对致病重复序列进行了优先级排序。我们还提供了两个新的致病重复序列扩展候选证据。总之,该分析显著阐明了超过170万个串联重复序列位点的短串联重复序列致病性潜力,并将有助于识别致病的重复序列扩展。