Research & Development, SelfDecode, Miami, FL, United States of America.
PLoS One. 2022 Oct 19;17(10):e0260177. doi: 10.1371/journal.pone.0260177. eCollection 2022.
Whole-genome data has become significantly more accessible over the last two decades. This can largely be attributed to both reduced sequencing costs and imputation models which make it possible to obtain nearly whole-genome data from less expensive genotyping methods, such as microarray chips. Although there are many different approaches to imputation, the Hidden Markov Model (HMM) remains the most widely used. In this study, we compared the latest versions of the most popular HMM-based tools for phasing and imputation: Beagle5.4, Eagle2.4.1, Shapeit4, Impute5 and Minimac4. We benchmarked them on four input datasets with three levels of chip density. We assessed each imputation software on the basis of accuracy, speed and memory usage, and showed how the choice of imputation accuracy metric can result in different interpretations. The highest average concordance rate was achieved by Beagle5.4, followed by Impute5 and Minimac4, using a reference-based approach during phasing and the highest density chip. IQS and R2 metrics revealed that Impute5 and Minimac4 obtained better results for low frequency markers, while Beagle5.4 remained more accurate for common markers (MAF>5%). Computational load as measured by run time was lower for Beagle5.4 than Minimac4 and Impute5, while Minimac4 utilized the least memory of the imputation tools we compared. ShapeIT4, used the least memory of the phasing tools examined with genotype chip data, while Eagle2.4.1 used the least memory phasing WGS data. Finally, we determined the combination of phasing software, imputation software, and reference panel, best suited for different situations and analysis needs and created an automated pipeline that provides a way for users to create customized chips designed to optimize their imputation results.
在过去的二十年中,全基因组数据的获取变得更加容易。这主要归因于测序成本的降低和推断模型的发展,这些模型使得从成本较低的基因分型方法(如微阵列芯片)中获得几乎完整的全基因组数据成为可能。尽管推断有许多不同的方法,但隐马尔可夫模型(HMM)仍然是应用最广泛的方法。在这项研究中,我们比较了基于 HMM 的最流行的相位推断和推断工具的最新版本:Beagle5.4、Eagle2.4.1、Shapeit4、Impute5 和 Minimac4。我们在四个输入数据集上进行了基准测试,这些数据集有三个不同的芯片密度级别。我们根据准确性、速度和内存使用情况评估了每种推断软件,并展示了选择推断准确性度量标准如何导致不同的解释。使用基于参考的方法进行相位推断,并使用最高密度的芯片,Beagle5.4 实现了最高的平均一致性率,其次是 Impute5 和 Minimac4。IQS 和 R2 指标表明,对于低频标记,Impute5 和 Minimac4 获得了更好的结果,而 Beagle5.4 对于常见标记(MAF>5%)仍然更准确。Beagle5.4 的运行时间比 Minimac4 和 Impute5 低,而 Minimac4 比我们比较的推断工具使用的内存更少。Shapeit4 在使用基因型芯片数据时使用的内存最少,而 Eagle2.4.1 在使用 WGS 数据时使用的内存最少。最后,我们确定了最适合不同情况和分析需求的相位推断软件、推断软件和参考面板组合,并创建了一个自动化管道,为用户提供了一种创建旨在优化其推断结果的定制芯片的方法。