Gustafson Jonas A, Gibson Sophia B, Damaraju Nikhita, Zalusky Miranda Pg, Hoekzema Kendra, Twesigomwe David, Yang Lei, Snead Anthony A, Richmond Phillip A, De Coster Wouter, Olson Nathan D, Guarracino Andrea, Li Qiuhui, Miller Angela L, Goffena Joy, Anderson Zachery, Storz Sophie Hr, Ward Sydney A, Sinha Maisha, Gonzaga-Jauregui Claudia, Clarke Wayne E, Basile Anna O, Corvelo André, Reeves Catherine, Helland Adrienne, Musunuri Rajeeva Lochan, Revsine Mahler, Patterson Karynne E, Paschal Cate R, Zakarian Christina, Goodwin Sara, Jensen Tanner D, Robb Esther, McCombie W Richard, Sedlazeck Fritz J, Zook Justin M, Montgomery Stephen B, Garrison Erik, Kolmogorov Mikhail, Schatz Michael C, McLaughlin Richard N, Dashnow Harriet, Zody Michael C, Loose Matt, Jain Miten, Eichler Evan E, Miller Danny E
Division of Genetic Medicine, Department of Pediatrics, University of Washington, Seattle, WA, USA.
Molecular and Cellular Biology Program, University of Washington, Seattle, WA, USA.
medRxiv. 2024 Mar 7:2024.03.05.24303792. doi: 10.1101/2024.03.05.24303792.
Less than half of individuals with a suspected Mendelian condition receive a precise molecular diagnosis after comprehensive clinical genetic testing. Improvements in data quality and costs have heightened interest in using long-read sequencing (LRS) to streamline clinical genomic testing, but the absence of control datasets for variant filtering and prioritization has made tertiary analysis of LRS data challenging. To address this, the 1000 Genomes Project ONT Sequencing Consortium aims to generate LRS data from at least 800 of the 1000 Genomes Project samples. Our goal is to use LRS to identify a broader spectrum of variation so we may improve our understanding of normal patterns of human variation. Here, we present data from analysis of the first 100 samples, representing all 5 superpopulations and 19 subpopulations. These samples, sequenced to an average depth of coverage of 37x and sequence read N50 of 54 kbp, have high concordance with previous studies for identifying single nucleotide and indel variants outside of homopolymer regions. Using multiple structural variant (SV) callers, we identify an average of 24,543 high-confidence SVs per genome, including shared and private SVs likely to disrupt gene function as well as pathogenic expansions within disease-associated repeats that were not detected using short reads. Evaluation of methylation signatures revealed expected patterns at known imprinted loci, samples with skewed X-inactivation patterns, and novel differentially methylated regions. All raw sequencing data, processed data, and summary statistics are publicly available, providing a valuable resource for the clinical genetics community to discover pathogenic SVs.
在进行全面的临床基因检测后,患有疑似孟德尔疾病的个体中,不到一半能获得精确的分子诊断。数据质量和成本的改善提高了人们对使用长读长测序(LRS)来简化临床基因组检测的兴趣,但缺乏用于变异过滤和优先级排序的对照数据集使得对LRS数据的三级分析具有挑战性。为了解决这一问题,千人基因组计划ONT测序联盟旨在从千人基因组计划的至少800个样本中生成LRS数据。我们的目标是使用LRS来识别更广泛的变异谱,以便我们能更好地理解人类变异的正常模式。在此,我们展示了对前100个样本的分析数据,这些样本代表了所有5个超级群体和19个亚群体。这些样本的平均测序深度为37倍,序列读长N50为54 kbp,在识别同聚物区域之外的单核苷酸和插入缺失变异方面与先前的研究具有高度一致性。使用多种结构变异(SV)检测工具,我们每个基因组平均识别出24,543个高可信度的SV,包括可能破坏基因功能的共享和私有SV,以及使用短读长未检测到的疾病相关重复序列中的致病性扩增。甲基化特征评估揭示了已知印迹位点、X染色体失活模式偏斜的样本以及新的差异甲基化区域的预期模式。所有原始测序数据、处理后的数据和汇总统计信息均公开可用,为临床遗传学领域发现致病性SV提供了宝贵资源。