Institut de Biologie Computationnelle, Montpellier, France.
Institut de Génétique Moléculaire de Montpellier, University of Montpellier, CNRS, Montpellier, France.
Nat Commun. 2021 Jun 2;12(1):3297. doi: 10.1038/s41467-021-23143-7.
Using the Cap Analysis of Gene Expression (CAGE) technology, the FANTOM5 consortium provided one of the most comprehensive maps of transcription start sites (TSSs) in several species. Strikingly, ~72% of them could not be assigned to a specific gene and initiate at unconventional regions, outside promoters or enhancers. Here, we probe these unassigned TSSs and show that, in all species studied, a significant fraction of CAGE peaks initiate at microsatellites, also called short tandem repeats (STRs). To confirm this transcription, we develop Cap Trap RNA-seq, a technology which combines cap trapping and long read MinION sequencing. We train sequence-based deep learning models able to predict CAGE signal at STRs with high accuracy. These models unveil the importance of STR surrounding sequences not only to distinguish STR classes, but also to predict the level of transcription initiation. Importantly, genetic variants linked to human diseases are preferentially found at STRs with high transcription initiation level, supporting the biological and clinical relevance of transcription initiation at STRs. Together, our results extend the repertoire of non-coding transcription associated with DNA tandem repeats and complexify STR polymorphism.
利用基因表达的帽分析(CAGE)技术,FANTOM5 联盟提供了几种物种中转录起始位点(TSS)的最全面图谱之一。引人注目的是,大约 72%的 TSS 无法分配到特定基因,并在启动子或增强子之外的非常规区域起始。在这里,我们研究了这些未分配的 TSS,并表明在所有研究的物种中,相当一部分 CAGE 峰起始于微卫星,也称为短串联重复(STR)。为了证实这种转录,我们开发了 Cap Trap RNA-seq,这是一种结合帽捕获和长读 MinION 测序的技术。我们训练基于序列的深度学习模型,能够以高精度预测 STR 处的 CAGE 信号。这些模型不仅揭示了 STR 周围序列区分 STR 类别的重要性,而且还预测了转录起始的水平。重要的是,与人类疾病相关的遗传变异优先存在于转录起始水平高的 STR 上,这支持了 STR 处转录起始的生物学和临床相关性。总之,我们的研究结果扩展了与 DNA 串联重复相关的非编码转录谱,并使 STR 多态性更加复杂。