Department of Molecular Bases of Human Genetics, Institute of Molecular Genetics of Russian Academy of Sciences, 2 Kurchatov sq., Moscow, 123182, Russia.
BMC Bioinformatics. 2020 Jul 24;21(Suppl 12):304. doi: 10.1186/s12859-020-03589-0.
The imputation of genotypes increases the power of genome-wide association studies. However, the imputation quality should be assessed in each particular case. Nevertheless, not all imputation softwares control the error of output, e.g., the last release of fastPHASE program (1.4.8) lacks such an option. In this particular software there is also an uncertainty in choosing the model parameters. fastPHASE is based on haplotype clusters, which size should be set a priori. The parameter influences the results of imputation and downstream analysis.
We present a software toolkit imputeqc to assess the imputation quality and/or to choose the model parameters for imputation. We demonstrate the efficacy of toolkit for evaluation of imputations made with both fastPHASE and BEAGLE software for HapMap and 1000 Genomes data. The discordance of genotypes received correlated well in both methods. Using imputeqc, we also shown how to choose the optimal number of haplotype clusters and expectation-maximization cycles for fastPHASE program. The found number of haplotype clusters of 25 was further applied for hapFLK testing that revealed signatures of selection at LCT region on chromosome 2. We also demonstrated how to decrease the computational time in the case of hapFLK testing from 3 days to 20 h.
The toolkit is implemented as an R package imputeqc and command line scripts. The code is freely available at https://github.com/inzilico/imputeqc under the MIT license.
基因型的推断可以提高全基因组关联研究的功效。然而,在每种特定情况下都应该评估推断的质量。尽管如此,并非所有的推断软件都能控制输出误差,例如,fastPHASE 程序的最新版本(1.4.8)缺乏这样的选项。在这个特定的软件中,选择模型参数也存在不确定性。fastPHASE 基于单倍型聚类,其大小应该预先设定。该参数会影响推断和下游分析的结果。
我们提出了一个软件工具包 imputeqc,用于评估推断质量和/或为推断选择模型参数。我们展示了该工具包在评估 fastPHASE 和 BEAGLE 软件对 HapMap 和 1000 基因组数据进行推断的功效。两种方法中,接收到的基因型的差异相关性都很好。使用 imputeqc,我们还展示了如何为 fastPHASE 程序选择最佳的单倍型聚类数量和期望最大化循环。发现的 25 个单倍型聚类数量进一步应用于 hapFLK 测试,该测试揭示了染色体 2 上 LCT 区域的选择特征。我们还展示了如何在 hapFLK 测试的情况下将计算时间从 3 天减少到 20 小时。
该工具包作为一个 R 包 imputeqc 和命令行脚本实现。代码在 MIT 许可证下可在 https://github.com/inzilico/imputeqc 上免费获得。