Kronenberg Zev, Nolan Cillian, Porubsky David, Mokveld Tom, Rowell William J, Lee Sangjin, Dolzhenko Egor, Chang Pi-Chuan, Holt James M, Saunders Christopher T, Olson Nathan D, Steely Cody J, McGee Sean, Guarracino Andrea, Koundinya Nidhi, Harvey William T, Watkins W Scott, Munson Katherine M, Hoekzema Kendra, Chua Khi Pin, Chen Xiao, Fanslow Cairbre, Lambert Christine, Dashnow Harriet, Garrison Erik, Smith Joshua D, Lansdorp Peter M, Zook Justin M, Carroll Andrew, Jorde Lynn B, Neklason Deborah W, Quinlan Aaron R, Eichler Evan E, Eberle Michael A
PacBio, Menlo Park, CA, USA.
Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA.
Nat Methods. 2025 Aug;22(8):1669-1676. doi: 10.1038/s41592-025-02750-y. Epub 2025 Aug 4.
Recent advances in genome sequencing have improved variant calling in complex regions of the human genome. However, it is difficult to quantify variant calling performance because existing standards often focus on specificity, neglecting completeness in difficult-to-analyze regions. To create a more comprehensive truth set, we used Mendelian inheritance in a large pedigree (CEPH-1463) to filter variants across PacBio high-fidelity (HiFi), Illumina and Oxford Nanopore Technologies platforms. This generated a variant map with over 4.7 million single-nucleotide variants, 767,795 insertions and deletions (indels), 537,486 tandem repeats and 24,315 structural variants, covering 2.77 Gb of the GRCh38 genome. This work adds ~200 Mb of high-confidence regions, including 8% more small variants, and introduces the first tandem repeat and structural variant truth sets for NA12878 and her family. As an example of the value of this improved benchmark, we retrained DeepVariant using these data to reduce genotyping errors by ~34%.
基因组测序的最新进展改进了人类基因组复杂区域的变异检测。然而,由于现有标准通常侧重于特异性,而忽略了难以分析区域的完整性,因此难以量化变异检测性能。为了创建一个更全面的真值集,我们利用一个大型家系(CEPH-1463)中的孟德尔遗传来筛选PacBio高保真(HiFi)、Illumina和牛津纳米孔技术平台上的变异。这生成了一个变异图谱,包含超过470万个单核苷酸变异、767,795个插入和缺失(indel)、537,486个串联重复以及24,315个结构变异,覆盖了GRCh38基因组的2.77Gb。这项工作增加了约200Mb的高置信度区域,包括多8%的小变异,并为NA12878及其家族引入了首个串联重复和结构变异真值集。作为这个改进基准价值的一个例子,我们使用这些数据重新训练了DeepVariant,将基因分型错误减少了约34%。