Department of Genetics and Genome Biology, University of Leicester, Leicester, United Kingdom.
Department of Population Health Sciences, University of Leicester, Leicester, United Kingdom.
PLoS One. 2024 Apr 1;19(4):e0300545. doi: 10.1371/journal.pone.0300545. eCollection 2024.
Short tandem repeat (STR) variation is an often overlooked source of variation between genomes. STRs comprise about 3% of the human genome and are highly polymorphic. Some cause Mendelian disease, and others affect gene expression. Their contribution to common disease is not well-understood, but recent software tools designed to genotype STRs using short read sequencing data will help address this. Here, we compare software that genotypes common STRs and rarer STR expansions genome-wide, with the aim of applying them to population-scale genomes. By using the Genome-In-A-Bottle (GIAB) consortium and 1000 Genomes Project short-read sequencing data, we compare performance in terms of sequence length, depth, computing resources needed, genotyping accuracy and number of STRs genotyped. To ensure broad applicability of our findings, we also measure genotyping performance against a set of genomes from clinical samples with known STR expansions, and a set of STRs commonly used for forensic identification. We find that HipSTR, ExpansionHunter and GangSTR perform well in genotyping common STRs, including the CODIS 13 core STRs used for forensic analysis. GangSTR and ExpansionHunter outperform HipSTR for genotyping call rate and memory usage. ExpansionHunter denovo (EHdn), STRling and GangSTR outperformed STRetch for detecting expanded STRs, and EHdn and STRling used considerably less processor time compared to GangSTR. Analysis on shared genomic sequence data provided by the GIAB consortium allows future performance comparisons of new software approaches on a common set of data, facilitating comparisons and allowing researchers to choose the best software that fulfils their needs.
短串联重复(STR)变异是基因组间变异的一个经常被忽视的来源。STR 约占人类基因组的 3%,高度多态性。有些导致孟德尔疾病,有些影响基因表达。它们对常见疾病的贡献尚不清楚,但最近设计用于使用短读测序数据对 STR 进行基因分型的软件工具将有助于解决这个问题。在这里,我们比较了用于全基因组常见 STR 和更罕见 STR 扩展的基因分型的软件,目的是将它们应用于人群规模的基因组。通过使用基因组瓶(GIAB)联盟和 1000 基因组项目短读测序数据,我们比较了序列长度、深度、所需计算资源、基因分型准确性和基因分型 STR 数量等方面的性能。为了确保我们的发现具有广泛的适用性,我们还针对一组具有已知 STR 扩展的临床样本基因组和一组常用于法医鉴定的 STR 测量了基因分型性能。我们发现 HipSTR、ExpansionHunter 和 GangSTR 在基因分型常见 STR 方面表现良好,包括用于法医分析的 CODIS 13 核心 STR。GangSTR 和 ExpansionHunter 在基因分型调用率和内存使用方面优于 HipSTR。ExpansionHunter denovo(EHdn)、STRling 和 GangSTR 比 STRetch 更能检测到扩展的 STR,与 GangSTR 相比,EHdn 和 STRling 消耗的处理器时间要少得多。对 GIAB 联盟提供的共享基因组序列数据的分析允许在一组共同的数据上对新软件方法的未来性能进行比较,从而促进了比较,并允许研究人员选择满足其需求的最佳软件。