Suppr超能文献

将长程相位推断和单倍型文库推断算法扩展到大型和异质数据集。

Extending long-range phasing and haplotype library imputation algorithms to large and heterogeneous datasets.

机构信息

The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Easter Bush, Midlothian, Scotland, UK.

出版信息

Genet Sel Evol. 2020 Jul 8;52(1):38. doi: 10.1186/s12711-020-00558-2.

Abstract

BACKGROUND

We describe the latest improvements to the long-range phasing (LRP) and haplotype library imputation (HLI) algorithms for successful phasing of both datasets with one million individuals and datasets genotyped using different sets of single nucleotide polymorphisms (SNPs). Previous publicly available implementations of the LRP algorithm implemented in AlphaPhase could not phase large datasets due to the computational cost of defining surrogate parents by exhaustive all-against-all searches. Furthermore, the AlphaPhase implementations of LRP and HLI were not designed to deal with large amounts of missing data that are inherent when using multiple SNP arrays.

METHODS

We developed methods that avoid the need for all-against-all searches by performing LRP on subsets of individuals and then concatenating the results. We also extended LRP and HLI algorithms to enable the use of different sets of markers, including missing values, when determining surrogate parents and identifying haplotypes. We implemented and tested these extensions in an updated version of AlphaPhase, and compared its performance to the software package Eagle2.

RESULTS

A simulated dataset with one million individuals genotyped with the same 6711 SNPs for a single chromosome took less than a day to phase, compared to more than seven days for Eagle2. The percentage of correctly phased alleles at heterozygous loci was 90.2 and 99.9% for AlphaPhase and Eagle2, respectively. A larger dataset with one million individuals genotyped with 49,579 SNPs for a single chromosome took AlphaPhase 23 days to phase, with 89.9% of alleles at heterozygous loci phased correctly. The phasing accuracy was generally lower for datasets with different sets of markers than with one set of markers. For a simulated dataset with three sets of markers, 1.5% of alleles at heterozygous positions were phased incorrectly, compared to 0.4% with one set of markers.

CONCLUSIONS

The improved LRP and HLI algorithms enable AlphaPhase to quickly and accurately phase very large and heterogeneous datasets. AlphaPhase is an order of magnitude faster than the other tested packages, although Eagle2 showed a higher level of phasing accuracy. The speed gain will make phasing achievable for very large genomic datasets in livestock, enabling more powerful breeding and genetics research and application.

摘要

背景

我们描述了长程相位(LRP)和单倍型库内插(HLI)算法的最新改进,这些改进可成功对包含一百万人的两个数据集和使用不同单核苷酸多态性(SNP)集进行基因分型的数据集进行相位分析。由于通过穷尽的全对全搜索来定义替代父母的计算成本,先前可公开获得的 AlphaPhase 中实现的 LRP 算法无法对大型数据集进行相位分析。此外,AlphaPhase 中实现的 LRP 和 HLI 并未设计用于处理使用多个 SNP 阵列时固有的大量缺失数据。

方法

我们开发了一些方法,通过对个体的子集进行 LRP 并串联结果来避免全对全搜索的需要。我们还扩展了 LRP 和 HLI 算法,以允许在确定替代父母和识别单倍型时使用不同的标记集,包括缺失值。我们在 AlphaPhase 的更新版本中实现并测试了这些扩展,并将其性能与 Eagle2 软件包进行了比较。

结果

对于一条染色体上用相同的 6711 个 SNP 对一百万人进行基因分型的模拟数据集,其相位分析不到一天即可完成,而 Eagle2 则需要超过七天。在杂合位置,AlphaPhase 和 Eagle2 正确相位的等位基因百分比分别为 90.2%和 99.9%。对于一条染色体上用 49579 个 SNP 对一百万人进行基因分型的更大数据集,AlphaPhase 需要 23 天进行相位分析,杂合位置的 89.9%等位基因相位正确。与具有一组标记的数据集相比,具有不同标记集的数据集的相位精度通常较低。对于具有三个标记集的模拟数据集,在杂合位置,1.5%的等位基因相位错误,而在具有一组标记的情况下,0.4%的等位基因相位错误。

结论

改进的 LRP 和 HLI 算法使 AlphaPhase 能够快速准确地对非常大且异构的数据集进行相位分析。与其他测试包相比,AlphaPhase 的速度快了一个数量级,尽管 Eagle2 显示出更高的相位分析准确性。速度的提高将使畜牧业中非常大的基因组数据集的相位分析成为可能,从而实现更强大的育种和遗传学研究和应用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/74e9/7346379/51aaba982cb6/12711_2020_558_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验