Data Sciences Platform, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
Program in Medical and Population Genetics and Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
Nat Genet. 2023 Sep;55(9):1589-1597. doi: 10.1038/s41588-023-01449-0. Epub 2023 Aug 21.
Copy number variants (CNVs) are major contributors to genetic diversity and disease. While standardized methods, such as the genome analysis toolkit (GATK), exist for detecting short variants, technical challenges have confounded uniform large-scale CNV analyses from whole-exome sequencing (WES) data. Given the profound impact of rare and de novo coding CNVs on genome organization and human disease, we developed GATK-gCNV, a flexible algorithm to discover rare CNVs from sequencing read-depth information, complete with open-source distribution via GATK. We benchmarked GATK-gCNV in 7,962 exomes from individuals in quartet families with matched genome sequencing and microarray data, finding up to 95% recall of rare coding CNVs at a resolution of more than two exons. We used GATK-gCNV to generate a reference catalog of rare coding CNVs in WES data from 197,306 individuals in the UK Biobank, and observed strong correlations between per-gene CNV rates and measures of mutational constraint, as well as rare CNV associations with multiple traits. In summary, GATK-gCNV is a tunable approach for sensitive and specific CNV discovery in WES data, with broad applications.
拷贝数变异 (CNVs) 是遗传多样性和疾病的主要贡献者。虽然存在用于检测短变体的标准化方法(如基因组分析工具包 (GATK)),但技术挑战使得从全外显子组测序 (WES) 数据进行统一的大规模 CNV 分析变得复杂。鉴于罕见和新生编码 CNVs 对基因组结构和人类疾病的深远影响,我们开发了 GATK-gCNV,这是一种从测序读深度信息中发现罕见 CNVs 的灵活算法,通过 GATK 提供开源分布。我们在具有匹配基因组测序和微阵列数据的四元家庭个体的 7962 个外显子中对 GATK-gCNV 进行了基准测试,在分辨率超过两个外显子的情况下,罕见编码 CNVs 的召回率高达 95%。我们使用 GATK-gCNV 生成了来自 UK Biobank 的 197,306 个个体的 WES 数据中罕见编码 CNVs 的参考目录,并观察到基因间 CNV 率与突变约束测量值之间存在很强的相关性,以及罕见 CNV 与多个特征之间的关联。总之,GATK-gCNV 是一种用于 WES 数据中敏感和特异性 CNV 发现的可调方法,具有广泛的应用。