Murdoch Children's Research Institute, Royal Children's Hospital, Parkville, VIC, 3052, Australia.
Peter MacCallum Cancer Centre, 305 Grattan St, Melbourne, VIC, 3000, Australia.
F1000Res. 2020 Mar 23;9:200. doi: 10.12688/f1000research.22639.1. eCollection 2020.
Short tandem repeats are an important source of genetic variation. They are highly mutable and repeat expansions are associated dozens of human disorders, such as Huntington's disease and spinocerebellar ataxias. Technical advantages in sequencing technology have made it possible to analyse these repeats at large scale; however, accurate genotyping is still a challenging task. We compared four different short tandem repeats genotyping tools on whole exome sequencing data to determine their genotyping performance and limits, which will aid other researchers in choosing a suitable tool and parameters for analysis. The analysis was performed on the Simons Simplex Collection dataset, where we used a novel method of evaluation with accuracy determined by the rate of homozygous calls on the X chromosome of male samples. In total we analysed 433 samples and around a million genotypes for evaluating tools on whole exome sequencing data. We determined a relatively good performance of all tools when genotyping repeats of 3-6 bp in length, which could be improved with coverage and quality score filtering. However, genotyping homopolymers was challenging for all tools and a high error rate was present across different thresholds of coverage and quality scores. Interestingly, dinucleotide repeats displayed a high error rate as well, which was found to be mainly caused by the AC/TG repeats. Overall, LobSTR was able to make the most calls and was also the fastest tool, while RepeatSeq and HipSTR exhibited the lowest heterozygous error rate at low coverage. All tools have different strengths and weaknesses and the choice may depend on the application. In this analysis we demonstrated the effect of using different filtering parameters and offered recommendations based on the trade-off between the best accuracy of genotyping and the highest number of calls.
短串联重复序列是遗传变异的重要来源。它们高度易变,重复扩展与数十种人类疾病有关,如亨廷顿病和脊髓小脑共济失调。测序技术的技术优势使其能够大规模分析这些重复序列;然而,准确的基因分型仍然是一项具有挑战性的任务。我们比较了四种不同的短串联重复序列基因分型工具在全外显子组测序数据上的表现,以确定它们的基因分型性能和限制,这将有助于其他研究人员选择合适的工具和分析参数。该分析是在西蒙斯单倍型集合数据集上进行的,我们使用了一种新的评估方法,其准确性通过男性样本 X 染色体上纯合子调用的比率来确定。我们总共分析了 433 个样本和大约 100 万个基因型,以评估全外显子组测序数据上的工具。我们确定了所有工具在基因分型 3-6bp 长度的重复序列时具有相对较好的性能,通过覆盖度和质量分数过滤可以提高其性能。然而,所有工具在基因分型同聚体时都具有挑战性,并且在不同的覆盖度和质量分数阈值下都存在高错误率。有趣的是,二核苷酸重复序列也显示出高错误率,这主要是由 AC/TG 重复序列引起的。总体而言,LobSTR 能够做出最多的调用,并且是最快的工具,而 RepeatSeq 和 HipSTR 在低覆盖度下表现出最低的杂合错误率。所有工具都有不同的优缺点,选择可能取决于应用。在本分析中,我们展示了使用不同过滤参数的效果,并根据基因分型最佳准确性和调用数量最高之间的权衡提供了建议。