Martin Alicia R, Tse Gerard, Bustamante Carlos D, Kenny Eimear E
Department of Genetics & Biomedical Informatics Training Program, Stanford University, Stanford, CA, 94305, USA.
Pac Symp Biocomput. 2014:241-52.
A striking finding from recent large-scale sequencing efforts is that the vast majority of variants in the human genome are rare and found within single populations or lineages. These observations hold important implications for the design of the next round of disease variant discovery efforts-if genetic variants that influence disease risk follow the same trend, then we expect to see population-specific disease associations that require large sample sizes for detection. To address this challenge, and due to the still prohibitive cost of sequencing large cohorts, researchers have developed a new generation of low-cost genotyping arrays that assay rare variation previously identified from large exome sequencing studies. Genotyping approaches rely not only on directly observing variants, but also on phasing and imputation methods that use publicly available reference panels to infer unobserved variants in a study cohort. Rare variant exome arrays are intentionally enriched for variants likely to be disease causing, and here we assay the ability of the first commercially available rare exome variant array (the Illumina Infinium HumanExome BeadChip) to also tag other potentially damaging variants not molecularly assayed. Using full sequence data from chromosome 22 from the phase I 1000 Genomes Project, we evaluate three methods for imputation (BEAGLE, MaCH-Admix, and SHAPEIT2/IMPUTE2) with the rare exome variant array under varied study panel sizes, reference panel sizes, and LD structures via population differences. We find that imputation is more accurate across both the genome and exome for common variant arrays than the next generation array for all allele frequencies, including rare alleles. We also find that imputation is the least accurate in African populations, and accuracy is substantially improved for rare variants when the same population is included in the reference panel. Depending on the goals of GWAS researchers, our results will aid budget decisions by helping determine whether money is best spent sequencing the genomes of smaller sample sizes, genotyping larger sample sizes with rare and/or common variant arrays and imputing SNPs, or some combination of the two.
近期大规模测序研究的一个显著发现是,人类基因组中的绝大多数变异都是罕见的,且仅存在于单个群体或谱系中。这些观察结果对新一轮疾病变异发现工作的设计具有重要意义——如果影响疾病风险的基因变异遵循相同趋势,那么我们预计会看到特定人群的疾病关联,而检测这些关联需要大样本量。为应对这一挑战,且由于对大型队列进行测序的成本仍然过高,研究人员开发了新一代低成本基因分型芯片,用于检测先前在大型外显子组测序研究中鉴定出的罕见变异。基因分型方法不仅依赖于直接观察变异,还依赖于定相和归因方法,这些方法利用公开可用的参考面板来推断研究队列中未观察到的变异。罕见变异外显子组芯片特意富集了可能导致疾病的变异,在此我们检测了首款商用罕见外显子组变异芯片(Illumina Infinium HumanExome BeadChip)标记其他未进行分子检测的潜在有害变异的能力。利用来自千人基因组计划一期22号染色体的全序列数据,我们通过群体差异,在不同的研究样本量、参考样本量和连锁不平衡结构下,评估了三种归因方法(BEAGLE、MaCH - Admix和SHAPEIT2/IMPUTE2)用于罕见外显子组变异芯片的情况。我们发现,对于所有等位基因频率,包括罕见等位基因,常见变异芯片在全基因组和外显子组上的归因都比新一代芯片更准确。我们还发现,归因在非洲人群中最不准确,而当参考面板中包含相同人群时,罕见变异的归因准确性会显著提高。根据全基因组关联研究(GWAS)研究人员的目标,我们的结果将有助于做出预算决策,帮助确定资金是最好用于对较小样本量的基因组进行测序,还是用罕见和/或常见变异芯片对较大样本量进行基因分型并归因单核苷酸多态性(SNP),或者是两者的某种组合。