Department of Experimental and Clinical Medicine, University of Florence, Viale Pieraccini 6, Florence 50134, Italy.
European Molecular Biology Laboratory (EMBL), GeneCore, Meyerhofstraße 1, Heidelberg 69117, Germany.
Gigascience. 2020 Oct 7;9(10). doi: 10.1093/gigascience/giaa101.
Tandem repeat sequences are widespread in the human genome, and their expansions cause multiple repeat-mediated disorders. Genome-wide discovery approaches are needed to fully elucidate their roles in health and disease, but resolving tandem repeat variation accurately remains a challenging task. While traditional mapping-based approaches using short-read data have severe limitations in the size and type of tandem repeats they can resolve, recent third-generation sequencing technologies exhibit substantially higher sequencing error rates, which complicates repeat resolution.
We developed TRiCoLOR, a freely available tool for tandem repeat profiling using error-prone long reads from third-generation sequencing technologies. The method can identify repetitive regions in sequencing data without a prior knowledge of their motifs or locations and resolve repeat multiplicity and period size in a haplotype-specific manner. The tool includes methods to interactively visualize the identified repeats and to trace their Mendelian consistency in pedigrees.
TRiCoLOR demonstrates excellent performance and improved sensitivity and specificity compared with alternative tools on synthetic data. For real human whole-genome sequencing data, TRiCoLOR achieves high validation rates, suggesting its suitability to identify tandem repeat variation in personal genomes.
串联重复序列广泛存在于人类基因组中,其扩展会导致多种重复介导的疾病。需要进行全基因组发现方法来充分阐明它们在健康和疾病中的作用,但准确解决串联重复变异仍然是一项具有挑战性的任务。虽然使用短读数据的传统基于映射的方法在其可以解决的串联重复的大小和类型方面具有严重的局限性,但最近的第三代测序技术表现出显著更高的测序错误率,这使得重复分辨率复杂化。
我们开发了 TRiCoLOR,这是一种免费的工具,用于使用第三代测序技术的易错长读进行串联重复分析。该方法可以在没有重复序列基序或位置先验知识的情况下识别测序数据中的重复区域,并以单倍型特异性的方式解决重复倍数和周期大小的问题。该工具包括用于交互式可视化识别重复序列和跟踪其在系谱中的孟德尔一致性的方法。
与替代工具相比,TRiCoLOR 在合成数据上表现出出色的性能,并且提高了灵敏度和特异性。对于真实的人类全基因组测序数据,TRiCoLOR 实现了高验证率,表明其适合识别个人基因组中的串联重复变异。