Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA 02114, USA; Program in Medical and Population Genetics and Stanley Center for Psychiatric Disorders, Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA 02142, USA; Department of Neurology, Massachusetts General Hospital and Harvard Medical School, Boston, MA 02114, USA.
Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA 02114, USA; Program in Medical and Population Genetics and Stanley Center for Psychiatric Disorders, Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA 02142, USA; Division of Medical Sciences, Harvard Medical School, Boston, MA 02115, USA.
Am J Hum Genet. 2021 May 6;108(5):919-928. doi: 10.1016/j.ajhg.2021.03.014. Epub 2021 Mar 30.
Virtually all genome sequencing efforts in national biobanks, complex and Mendelian disease programs, and medical genetic initiatives are reliant upon short-read whole-genome sequencing (srWGS), which presents challenges for the detection of structural variants (SVs) relative to emerging long-read WGS (lrWGS) technologies. Given this ubiquity of srWGS in large-scale genomics initiatives, we sought to establish expectations for routine SV detection from this data type by comparison with lrWGS assembly, as well as to quantify the genomic properties and added value of SVs uniquely accessible to each technology. Analyses from the Human Genome Structural Variation Consortium (HGSVC) of three families captured ~11,000 SVs per genome from srWGS and ~25,000 SVs per genome from lrWGS assembly. Detection power and precision for SV discovery varied dramatically by genomic context and variant class: 9.7% of the current GRCh38 reference is defined by segmental duplication (SD) and simple repeat (SR), yet 91.4% of deletions that were specifically discovered by lrWGS localized to these regions. Across the remaining 90.3% of reference sequence, we observed extremely high (93.8%) concordance between technologies for deletions in these datasets. In contrast, lrWGS was superior for detection of insertions across all genomic contexts. Given that non-SD/SR sequences encompass 95.9% of currently annotated disease-associated exons, improved sensitivity from lrWGS to discover novel pathogenic deletions in these currently interpretable genomic regions is likely to be incremental. However, these analyses highlight the considerable added value of assembly-based lrWGS to create new catalogs of insertions and transposable elements, as well as disease-associated repeat expansions in genomic sequences that were previously recalcitrant to routine assessment.
实际上,国家生物银行、复杂和孟德尔疾病计划以及医学遗传计划中的几乎所有基因组测序工作都依赖于短读长全基因组测序(srWGS),这相对于新兴的长读长 WGS(lrWGS)技术来说,在检测结构变异(SVs)方面存在挑战。鉴于 srWGS 在大规模基因组学计划中的普遍存在,我们通过与 lrWGS 组装进行比较,试图为从这种数据类型中常规检测 SV 建立预期,并量化每种技术独有的 SV 的基因组特性和附加值。人类基因组结构变异联盟(HGSVC)的三个家庭的分析从 srWGS 捕获了每个基因组约 11000 个 SV,从 lrWGS 组装捕获了每个基因组约 25000 个 SV。SV 发现的检测能力和精度因基因组背景和变体类别而有很大差异:当前的 GRCh38 参考基因组的 9.7%由片段重复(SD)和简单重复(SR)定义,但 lrWGS 专门发现的 91.4%缺失位于这些区域。在参考序列的其余 90.3%中,我们观察到这两个数据集的缺失在技术之间具有极高的(93.8%)一致性。相比之下,lrWGS 在所有基因组背景下都更适合检测插入。鉴于非 SD/SR 序列包含当前注释的与疾病相关的外显子的 95.9%,从 lrWGS 提高发现这些当前可解释基因组区域中新型致病性缺失的灵敏度可能是增量的。然而,这些分析突出了基于组装的 lrWGS 的巨大附加值,它可以创建插入和转座元件的新目录,以及以前难以常规评估的基因组序列中与疾病相关的重复扩展。