Sargolzaei Mehdi, Chesnais Jacques P, Schenkel Flavio S
Centre for Genetic Improvement of Livestock, Animal and Poultry Science Department, University of Guelph, 50 Stone Road East, Guelph, ON, Canada.
BMC Genomics. 2014 Jun 17;15(1):478. doi: 10.1186/1471-2164-15-478.
Genotype imputation can help reduce genotyping costs particularly for implementation of genomic selection. In applications entailing large populations, recovering the genotypes of untyped loci using information from reference individuals that were genotyped with a higher density panel is computationally challenging. Popular imputation methods are based upon the Hidden Markov model and have computational constraints due to an intensive sampling process. A fast, deterministic approach, which makes use of both family and population information, is presented here. All individuals are related and, therefore, share haplotypes which may differ in length and frequency based on their relationships. The method starts with family imputation if pedigree information is available, and then exploits close relationships by searching for long haplotype matches in the reference group using overlapping sliding windows. The search continues as the window size is shrunk in each chromosome sweep in order to capture more distant relationships.
The proposed method gave higher or similar imputation accuracy than Beagle and Impute2 in cattle data sets when all available information was used. When close relatives of target individuals were present in the reference group, the method resulted in higher accuracy compared to the other two methods even when the pedigree was not used. Rare variants were also imputed with higher accuracy. Finally, computing requirements were considerably lower than those of Beagle and Impute2. The presented method took 28 minutes to impute from 6 k to 50 k genotypes for 2,000 individuals with a reference size of 64,429 individuals.
The proposed method efficiently makes use of information from close and distant relatives for accurate genotype imputation. In addition to its high imputation accuracy, the method is fast, owing to its deterministic nature and, therefore, it can easily be used in large data sets where the use of other methods is impractical.
基因型填充有助于降低基因分型成本,特别是在基因组选择的实施中。在涉及大量群体的应用中,利用高密度基因分型参考个体的信息来恢复未分型位点的基因型在计算上具有挑战性。常用的填充方法基于隐马尔可夫模型,由于密集的采样过程而存在计算限制。本文提出了一种快速、确定性的方法,该方法利用家系和群体信息。所有个体都存在亲缘关系,因此共享单倍型,这些单倍型的长度和频率可能因其亲缘关系而有所不同。如果有系谱信息,该方法首先进行家系填充,然后通过使用重叠滑动窗口在参考群体中搜索长单倍型匹配来利用紧密的亲缘关系。随着在每个染色体扫描中窗口大小缩小,搜索继续进行,以捕获更远的亲缘关系。
当使用所有可用信息时,在牛数据集上,所提出的方法比Beagle和Impute2具有更高或相似的填充准确性。当参考群体中存在目标个体的近亲时,即使不使用系谱,该方法也比其他两种方法具有更高的准确性。罕见变异的填充准确性也更高。最后,计算需求比Beagle和Impute2低得多。对于2000个个体,参考大小为64429个个体,所提出的方法从6k基因型填充到50k基因型需要28分钟。
所提出的方法有效地利用了远近亲的信息进行准确的基因型填充。除了具有高填充准确性外,该方法由于其确定性性质而速度快,因此可以很容易地用于使用其他方法不切实际的大数据集。