Faux Pierre, Druet Tom
Unit of Animal Genomics, GIGA-R and Faculty of Veterinary Medicine, University of Liège, 4000, Liège, Belgium.
Genet Sel Evol. 2017 May 16;49(1):46. doi: 10.1186/s12711-017-0321-6.
Haplotype reconstruction (phasing) is an essential step in many applications, including imputation and genomic selection. The best phasing methods rely on both familial and linkage disequilibrium (LD) information. With whole-genome sequence (WGS) data, relatively small samples of reference individuals are generally sequenced due to prohibitive sequencing costs, thus only a limited amount of familial information is available. However, reference individuals have many relatives that have been genotyped (at lower density). The goal of our study was to improve phasing of WGS data by integrating familial information from haplotypes that were obtained from a larger genotyped dataset and to quantify its impact on imputation accuracy.
Aligning a pre-phased WGS panel [5 million single nucleotide polymorphisms (SNPs)], which is based on LD information only, to a 50k SNP array that is phased with both LD and familial information (called scaffold) resulted in correctly assigning parental origin for 99.62% of the WGS SNPs, their phase being determined unambiguously based on parental genotypes. Without using the 50k haplotypes as scaffold, that value dropped as expected to 50%. Correctly phased segments were on average longer after alignment to the genotype phase while the number of switches decreased slightly. Most of the incorrectly assigned segments, and subsequent switches, were due to singleton errors. Imputation from 50k SNP array to WGS data with improved phasing had a marginal impact on imputation accuracy (measured as r ), i.e. on average, 90.47% with traditional techniques versus 90.65% with pre-phasing integrating familial information. Differences were larger for SNPs located in chromosome ends and rare variants. Using a denser WGS panel (13 millions SNPs) that was obtained with traditional variant filtering rules, we found similar results although performances of both phasing and imputation accuracy were lower.
We present a phasing strategy for WGS data, which indirectly integrates familial information by aligning WGS haplotypes that are pre-phased with LD information only on haplotypes obtained with genotyping data, with both LD and familial information and on a much larger population. This strategy results in very few mismatches with the phase obtained by Mendelian segregation rules. Finally, we propose a strategy to further improve phasing accuracy based on haplotype clusters obtained with genotyping data.
单倍型重建(定相)是许多应用中的关键步骤,包括基因填充和基因组选择。最佳的定相方法依赖家系信息和连锁不平衡(LD)信息。对于全基因组序列(WGS)数据,由于测序成本过高,通常仅对相对少量的参考个体进行测序,因此可获得的家系信息有限。然而,参考个体有许多已进行基因分型(低密度)的亲属。我们研究的目的是通过整合来自更大基因分型数据集的单倍型家系信息来改善WGS数据的定相,并量化其对基因填充准确性的影响。
将仅基于LD信息预定相的WGS面板[约500万个单核苷酸多态性(SNP)]与通过LD和家系信息定相的50k SNP阵列(称为支架)进行比对,结果显示99.62%的WGS SNP能够正确确定亲本来源,其相位根据亲本基因型明确确定。若不使用50k单倍型作为支架,该值如预期降至50%。与基因型相位比对后,正确定相的片段平均更长,而切换次数略有减少。大多数错误分配的片段及随后的切换是由于单例错误。从50k SNP阵列到经改进定相的WGS数据的基因填充对填充准确性(以r衡量)的影响很小,即传统技术平均为90.47%,而预定相整合家系信息时为90.65%。位于染色体末端的SNP和罕见变异的差异更大。使用通过传统变异过滤规则获得的密度更高的WGS面板(约1300万个SNP),我们发现了类似结果,尽管定相和填充准确性的表现均较低。
我们提出了一种针对WGS数据的定相策略,该策略通过将仅基于LD信息预定相的WGS单倍型与通过基因分型数据获得的、同时包含LD和家系信息且样本量更大的单倍型进行比对,间接整合家系信息。此策略与通过孟德尔分离规则获得的相位的不匹配极少。最后,我们提出了一种基于通过基因分型数据获得的单倍型簇进一步提高定相准确性的策略。