Astling David P, Heft Ilea E, Jones Kenneth L, Sikela James M
Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Aurora, CO, USA.
Department of Pediatrics, University of Colorado School of Medicine, Aurora, CO, USA.
BMC Genomics. 2017 Aug 14;18(1):614. doi: 10.1186/s12864-017-3976-z.
DUF1220 protein domains found primarily in Neuroblastoma BreakPoint Family (NBPF) genes show the greatest human lineage-specific increase in copy number of any coding region in the genome. There are 302 haploid copies of DUF1220 in hg38 (~160 of which are human-specific) and the majority of these can be divided into 6 different subtypes (referred to as clades). Copy number changes of specific DUF1220 clades have been associated in a dose-dependent manner with brain size variation (both evolutionarily and within the human population), cognitive aptitude, autism severity, and schizophrenia severity. However, no published methods can directly measure copies of DUF1220 with high accuracy and no method can distinguish between domains within a clade.
Here we describe a novel method for measuring copies of DUF1220 domains and the NBPF genes in which they are found from whole genome sequence data. We have characterized the effect that various sequencing and alignment parameters and strategies have on the accuracy and precision of the method and defined the parameters that lead to optimal DUF1220 copy number measurement and resolution. We show that copy number estimates obtained using our read depth approach are highly correlated with those generated by ddPCR for three representative DUF1220 clades. By simulation, we demonstrate that our method provides sufficient resolution to analyze DUF1220 copy number variation at three levels: (1) DUF1220 clade copy number within individual genes and groups of genes (gene-specific clade groups) (2) genome wide DUF1220 clade copies and (3) gene copy number for DUF1220-encoding genes.
To our knowledge, this is the first method to accurately measure copies of all six DUF1220 clades and the first method to provide gene specific resolution of these clades. This allows one to discriminate among the ~300 haploid human DUF1220 copies to an extent not possible with any other method. The result is a greatly enhanced capability to analyze the role that these sequences play in human variation and disease.
主要在成神经细胞瘤断点家族(NBPF)基因中发现的DUF1220蛋白结构域,在基因组中任何编码区域的拷贝数增加方面,显示出人类谱系特异性的最大增幅。在hg38中,DUF1220有302个单倍体拷贝(其中约160个是人类特有的),并且其中大多数可分为6种不同的亚型(称为进化枝)。特定DUF1220进化枝的拷贝数变化已与脑容量变异(在进化过程中和人类群体内部)、认知能力、自闭症严重程度和精神分裂症严重程度呈剂量依赖性相关。然而,尚无已发表的方法能够直接高精度地测量DUF1220的拷贝数,也没有方法能够区分一个进化枝内的各个结构域。
在此,我们描述了一种从全基因组序列数据中测量DUF1220结构域及其所在的NBPF基因拷贝数的新方法。我们已经表征了各种测序和比对参数及策略对该方法的准确性和精确性的影响,并确定了能实现最佳DUF1220拷贝数测量和分辨率的参数。我们表明,使用我们的读深度方法获得的拷贝数估计值与通过ddPCR针对三个代表性DUF1220进化枝生成的估计值高度相关。通过模拟,我们证明我们的方法提供了足够的分辨率,可在三个层面分析DUF1220拷贝数变异:(1)单个基因和基因组(基因特异性进化枝组)内的DUF1220进化枝拷贝数;(2)全基因组DUF1220进化枝拷贝数;(3)DUF1220编码基因的基因拷贝数。
据我们所知,这是第一种准确测量所有六个DUF1220进化枝拷贝数的方法,也是第一种提供这些进化枝基因特异性分辨率的方法。这使得人们能够在一定程度上区分约300个单倍体人类DUF1220拷贝,这是其他任何方法都无法做到的。结果是大大增强了分析这些序列在人类变异和疾病中所起作用的能力。