Spiliopoulou Athina, Colombo Marco, Orchard Peter, Agakov Felix, McKeigue Paul
Centre for Population Health Sciences, Usher Institute, University of Edinburgh, EH8 9AG, United Kingdom.
Pharmatics Ltd., Edinburgh, EH16 4UX, United Kingdom.
Genetics. 2017 May;206(1):91-104. doi: 10.1534/genetics.117.200063. Epub 2017 Mar 27.
We address the task of genotype imputation to a dense reference panel given genotype likelihoods computed from ultralow coverage sequencing as inputs. In this setting, the data have a high-level of missingness or uncertainty, and are thus more amenable to a probabilistic representation. Most existing imputation algorithms are not well suited for this situation, as they rely on prephasing for computational efficiency, and, without definite genotype calls, the prephasing task becomes computationally expensive. We describe GeneImp, a program for genotype imputation that does not require prephasing and is computationally tractable for whole-genome imputation. GeneImp does not explicitly model recombination, instead it capitalizes on the existence of large reference panels-comprising thousands of reference haplotypes-and assumes that the reference haplotypes can adequately represent the target haplotypes over short regions unaltered. We validate GeneImp based on data from ultralow coverage sequencing (0.5×), and compare its performance to the most recent version of BEAGLE that can perform this task. We show that GeneImp achieves imputation quality very close to that of BEAGLE, using one to two orders of magnitude less time, without an increase in memory complexity. Therefore, GeneImp is the first practical choice for whole-genome imputation to a dense reference panel when prephasing cannot be applied, for instance, in datasets produced via ultralow coverage sequencing. A related future application for GeneImp is whole-genome imputation based on the off-target reads from deep whole-exome sequencing.
我们针对给定从超低覆盖度测序计算出的基因型似然值作为输入的情况,解决向密集参考面板进行基因型填充的任务。在此情形下,数据存在高度缺失或不确定性,因此更适合采用概率表示。大多数现有的填充算法不太适用于这种情况,因为它们为了计算效率依赖于预分相,并且在没有明确的基因型调用时,预分相任务的计算成本会很高。我们描述了GeneImp,一个用于基因型填充的程序,它不需要预分相,并且对于全基因组填充在计算上是可行的。GeneImp没有明确地对重组进行建模,相反,它利用了包含数千个参考单倍型的大型参考面板的存在,并假设参考单倍型可以在未改变的短区域上充分代表目标单倍型。我们基于超低覆盖度测序(0.5×)的数据验证了GeneImp,并将其性能与能够执行此任务的最新版本的BEAGLE进行比较。我们表明,GeneImp实现的填充质量与BEAGLE非常接近,使用的时间少一到两个数量级,且内存复杂度没有增加。因此,当无法应用预分相时,例如在通过超低覆盖度测序产生的数据集中,GeneImp是向密集参考面板进行全基因组填充的首个实际选择。GeneImp未来的一个相关应用是基于深度全外显子测序的脱靶读数进行全基因组填充。