Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA.
Department of Data Science, Dana-Farber Cancer Institute, Boston, MA, USA.
Nat Biotechnol. 2022 May;40(5):672-680. doi: 10.1038/s41587-021-01158-1. Epub 2022 Feb 7.
The repetitive nature and complexity of some medically relevant genes poses a challenge for their accurate analysis in a clinical setting. The Genome in a Bottle Consortium has provided variant benchmark sets, but these exclude nearly 400 medically relevant genes due to their repetitiveness or polymorphic complexity. Here, we characterize 273 of these 395 challenging autosomal genes using a haplotype-resolved whole-genome assembly. This curated benchmark reports over 17,000 single-nucleotide variations, 3,600 insertions and deletions and 200 structural variations each for human genome reference GRCh37 and GRCh38 across HG002. We show that false duplications in either GRCh37 or GRCh38 result in reference-specific, missed variants for short- and long-read technologies in medically relevant genes, including CBS, CRYAA and KCNE1. When masking these false duplications, variant recall can improve from 8% to 100%. Forming benchmarks from a haplotype-resolved whole-genome assembly may become a prototype for future benchmarks covering the whole genome.
一些与医学相关的基因具有重复性质和复杂结构,这给它们在临床环境中的准确分析带来了挑战。“瓶中基因组”联盟提供了变异基准集,但由于其重复性或多态性复杂性,这些基准集排除了近 400 个与医学相关的基因。在这里,我们使用单倍型解析全基因组组装来描述这 395 个具有挑战性的常染色体基因中的 273 个。该经过精心整理的基准报告了超过 17000 个单核苷酸变异、3600 个插入和缺失以及 200 个结构变异,涵盖了人类基因组参考 GRCh37 和 GRCh38 跨越 HG002 的每个基因。我们表明,无论是在 GRCh37 还是 GRCh38 中,假重复都会导致短读长和长读长技术在与医学相关的基因中出现参考特异性缺失变异,包括 CBS、CRYAA 和 KCNE1。当屏蔽这些假重复时,变异召回率可以从 8%提高到 100%。从单倍型解析全基因组组装中形成基准可能成为覆盖整个基因组的未来基准的原型。