Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan 48109.
Genetics. 2013 Oct;195(2):319-30. doi: 10.1534/genetics.113.154591. Epub 2013 Aug 9.
The recent dramatic cost reduction of next-generation sequencing technology enables investigators to assess most variants in the human genome to identify risk variants for complex diseases. However, sequencing large samples remains very expensive. For a study sample with existing genotype data, such as array data from genome-wide association studies, a cost-effective approach is to sequence a subset of the study sample and then to impute the rest of the study sample, using the sequenced subset as a reference panel. The use of such an internal reference panel identifies population-specific variants and avoids the problem of a substantial mismatch in ancestry background between the study population and the reference population. To efficiently select an internal panel, we introduce an idea of phylogenetic diversity from mathematical phylogenetics and comparative genomics. We propose the "most diverse reference panel", defined as the subset with the maximal "phylogenetic diversity", thereby incorporating individuals that span a diverse range of genotypes within the sample. Using data both from simulations and from the 1000 Genomes Project, we show that the most diverse reference panel can substantially improve the imputation accuracy compared to randomly selected reference panels, especially for the imputation of rare variants. The improvement in imputation accuracy holds across different marker densities, reference panel sizes, and lengths for the imputed segments. We thus propose a novel strategy for planning sequencing studies on samples with existing genotype data.
近年来,下一代测序技术成本的大幅降低使研究人员能够评估人类基因组中的大多数变体,以鉴定复杂疾病的风险变体。然而,对大样本进行测序仍然非常昂贵。对于具有现有基因型数据的研究样本,例如全基因组关联研究的阵列数据,可以采用一种经济有效的方法,对研究样本的一部分进行测序,然后使用测序子集作为参考面板对其余样本进行推断。使用这种内部参考面板可以识别出特定于人群的变体,并避免研究人群和参考人群在祖先背景方面存在显著不匹配的问题。为了有效地选择内部面板,我们从数学系统发生学和比较基因组学中引入了系统发生多样性的概念。我们提出了“最多样化的参考面板”,定义为具有最大“系统发生多样性”的子集,从而纳入了样本中基因型多样化的个体。使用来自模拟和 1000 基因组计划的数据,我们表明与随机选择的参考面板相比,最多样化的参考面板可以显著提高推断准确性,特别是对于稀有变体的推断。这种推断准确性的提高在不同的标记密度、参考面板大小和推断片段长度下都成立。因此,我们提出了一种针对具有现有基因型数据的样本进行测序研究的新策略。