对数千份基因分型样本进行分相。

Phasing of many thousands of genotyped samples.

机构信息

Department of Genetics, Harvard Medical School, Boston, MA 02115, USA; Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.

出版信息

Am J Hum Genet. 2012 Aug 10;91(2):238-51. doi: 10.1016/j.ajhg.2012.06.013.

DOI:10.1016/j.ajhg.2012.06.013

PMID:22883141

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3415548/

Abstract

Haplotypes are an important resource for a large number of applications in human genetics, but computationally inferred haplotypes are subject to switch errors that decrease their utility. The accuracy of computationally inferred haplotypes increases with sample size, and although ever larger genotypic data sets are being generated, the fact that existing methods require substantial computational resources limits their applicability to data sets containing tens or hundreds of thousands of samples. Here, we present HAPI-UR (haplotype inference for unrelated samples), an algorithm that is designed to handle unrelated and/or trio and duo family data, that has accuracy comparable to or greater than existing methods, and that is computationally efficient and can be applied to 100,000 samples or more. We use HAPI-UR to phase a data set with 58,207 samples and show that it achieves practical runtime and that switch errors decrease with sample size even with the use of samples from multiple ethnicities. Using a data set with 16,353 samples, we compare HAPI-UR to Beagle, MaCH, IMPUTE2, and SHAPEIT and show that HAPI-UR runs 18× faster than all methods and has a lower switch-error rate than do other methods except for Beagle; with the use of consensus phasing, running HAPI-UR three times gives a slightly lower switch-error rate than Beagle does and is more than six times faster. We demonstrate results similar to those from Beagle on another data set with a higher marker density. Lastly, we show that HAPI-UR has better runtime scaling properties than does Beagle so that for larger data sets, HAPI-UR will be practical and will have an even larger runtime advantage. HAPI-UR is available online (see Web Resources).

摘要

单体型是人类遗传学中许多应用的重要资源，但计算推断的单体型容易发生转换错误，从而降低其使用价值。计算推断的单体型的准确性随着样本量的增加而提高，尽管越来越大的基因型数据集正在生成，但现有的方法需要大量的计算资源，这限制了它们在包含数十万或数十万样本的数据集中的适用性。在这里，我们提出了 HAPI-UR（无关样本单体型推断），这是一种专为处理无关和/或三亲和二联体家族数据而设计的算法，它具有与现有方法相当或更高的准确性，并且计算效率高，可以应用于 10 万个或更多的样本。我们使用 HAPI-UR 对一个包含 58207 个样本的数据集进行了相位分析，结果表明它具有实际的运行时间，并且即使使用来自多个种族的样本，转换错误也会随着样本数量的增加而减少。使用一个包含 16353 个样本的数据集，我们将 HAPI-UR 与 Beagle、MaCH、 IMPUTE2 和 SHAPEIT 进行了比较，结果表明 HAPI-UR 的运行速度比所有方法都快 18 倍，转换错误率比除 Beagle 之外的其他方法都低；使用共识相位，运行 HAPI-UR 三次的转换错误率略低于 Beagle，速度是其的六倍以上。我们在另一个标记密度更高的数据集上展示了与 Beagle 类似的结果。最后，我们表明 HAPI-UR 具有比 Beagle 更好的运行时扩展特性，因此对于更大的数据集，HAPI-UR 将是实用的，并且将具有更大的运行时优势。HAPI-UR 可在线获得（见网络资源）。

相似文献

Phasing of many thousands of genotyped samples.对数千份基因分型样本进行分相。

Am J Hum Genet. 2012 Aug 10;91(2):238-51. doi: 10.1016/j.ajhg.2012.06.013.

A comparison of different algorithms for phasing haplotypes using Holstein cattle genotypes and pedigree data.使用荷斯坦奶牛基因型和系谱数据对不同单倍型定相算法的比较。

J Dairy Sci. 2017 Apr;100(4):2837-2849. doi: 10.3168/jds.2016-11590. Epub 2017 Feb 1.

Phasing for medical sequencing using rare variants and large haplotype reference panels.使用罕见变异和大型单倍型参考面板进行医学测序的定相分析。

Bioinformatics. 2016 Jul 1;32(13):1974-80. doi: 10.1093/bioinformatics/btw065. Epub 2016 Feb 27.

Phasing quality assessment in a brown layer population through family- and population-based software.通过基于家系和群体的软件对棕色层群体进行分相质量评估。

BMC Genet. 2019 Jul 17;20(1):57. doi: 10.1186/s12863-019-0759-3.

Assessment of the performance of hidden Markov models for imputation in animal breeding.评估隐马尔可夫模型在动物育种中插补的性能。

Genet Sel Evol. 2018 Sep 17;50(1):44. doi: 10.1186/s12711-018-0416-8.

Fast two-stage phasing of large-scale sequence data.大规模序列数据的快速两阶段相位测定。

Am J Hum Genet. 2021 Oct 7;108(10):1880-1890. doi: 10.1016/j.ajhg.2021.08.005. Epub 2021 Sep 2.

Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering.利用局部单倍型聚类对全基因组关联研究进行快速准确的单倍型分型和缺失数据推断。

Am J Hum Genet. 2007 Nov;81(5):1084-97. doi: 10.1086/521987. Epub 2007 Sep 21.

Rapid haplotype inference for nuclear families.快速核型推断的家族。

Genome Biol. 2010;11(10):R108. doi: 10.1186/gb-2010-11-10-r108. Epub 2010 Oct 29.

Benchmarking phasing software with a whole-genome sequenced cattle pedigree.利用全基因组测序的牛系谱对相位软件进行基准测试。

BMC Genomics. 2022 Feb 15;23(1):130. doi: 10.1186/s12864-022-08354-6.

Inference of Chromosome-Length Haplotypes Using Genomic Data of Three or a Few More Single Gametes.使用三或更多单配子的基因组数据推断染色体长度单倍型。

Mol Biol Evol. 2020 Dec 16;37(12):3684-3698. doi: 10.1093/molbev/msaa176.

引用本文的文献

Induced and natural variation affect traits independently in hybrid Populus.诱导变异和自然变异独立影响杂种杨的性状。

G3 (Bethesda). 2024 Nov 6;14(11). doi: 10.1093/g3journal/jkae218.

Cancer risks among first-degree relatives of women with a genetic predisposition to breast cancer.遗传性乳腺癌女性一级亲属的癌症发病风险。

J Natl Cancer Inst. 2024 Jun 7;116(6):911-919. doi: 10.1093/jnci/djae030.

Prediction of breast cancer risk for sisters of women attending screening.筛查就诊女性姐妹的乳腺癌风险预测。

J Natl Cancer Inst. 2023 Nov 8;115(11):1310-1317. doi: 10.1093/jnci/djad101.

The genomic analysis of current-day North African populations reveals the existence of trans-Saharan migrations with different origins and dates.对当今北非人口的基因组分析揭示了存在不同起源和日期的跨撒哈拉移民。

Hum Genet. 2023 Feb;142(2):305-320. doi: 10.1007/s00439-022-02503-3. Epub 2022 Nov 28.

Accurate genome-wide phasing from IBD data.基于 IBD 数据的精确全基因组相位推断。

BMC Bioinformatics. 2022 Nov 23;23(1):502. doi: 10.1186/s12859-022-05066-2.

A comparative analysis of current phasing and imputation software.当前相位分析和插补软件的比较分析。

PLoS One. 2022 Oct 19;17(10):e0260177. doi: 10.1371/journal.pone.0260177. eCollection 2022.

Reconstructing the history of founder events using genome-wide patterns of allele sharing across individuals.利用个体间等位基因共享的全基因组模式重建奠基者事件的历史。

PLoS Genet. 2022 Jun 23;18(6):e1010243. doi: 10.1371/journal.pgen.1010243. eCollection 2022 Jun.

Genotype error biases trio-based estimates of haplotype phase accuracy.基于家系的单体型相位准确性估计会受到基因型错误的偏倚。

Am J Hum Genet. 2022 Jun 2;109(6):1016-1025. doi: 10.1016/j.ajhg.2022.04.019.

Impact of polygenic risk for coronary artery disease and cardiovascular medication burden on cognitive impairment in psychotic disorders.载脂蛋白 E 基因多态性与载脂蛋白 E 基因多态性及心血管药物负担对精神障碍患者认知障碍的影响。

Prog Neuropsychopharmacol Biol Psychiatry. 2022 Mar 8;113:110464. doi: 10.1016/j.pnpbp.2021.110464. Epub 2021 Oct 29.

Fast two-stage phasing of large-scale sequence data.大规模序列数据的快速两阶段相位测定。

Am J Hum Genet. 2021 Oct 7;108(10):1880-1890. doi: 10.1016/j.ajhg.2021.08.005. Epub 2021 Sep 2.

本文引用的文献

Genotype imputation with thousands of genomes.使用数千份基因组进行基因型推断。

G3 (Bethesda). 2011 Nov;1(6):457-70. doi: 10.1534/g3.111.001198. Epub 2011 Nov 1.

A linear complexity phasing method for thousands of genomes.一种用于数千个基因组的线性复杂度相位分析方法。

Nat Methods. 2011 Dec 4;9(2):179-81. doi: 10.1038/nmeth.1785.

Genome-wide copy number variation study associates metabotropic glutamate receptor gene networks with attention deficit hyperactivity disorder.全基因组拷贝数变异研究将代谢型谷氨酸受体基因网络与注意缺陷多动障碍相关联。

Nat Genet. 2011 Dec 4;44(1):78-84. doi: 10.1038/ng.1013.

Haplotype phasing: existing methods and new developments.单体型相位确定：现有方法和新进展。

Nat Rev Genet. 2011 Sep 16;12(10):703-14. doi: 10.1038/nrg3054.

Imputation of low-frequency variants using the HapMap3 benefits from large, diverse reference sets.利用 HapMap3 对低频变异进行推断得益于大型多样的参考集。

Eur J Hum Genet. 2011 Jun;19(6):662-6. doi: 10.1038/ejhg.2011.10. Epub 2011 Mar 2.

Haplotype-resolved genome sequencing of a Gujarati Indian individual.单体型解析的古吉拉特邦印度个体基因组测序。

Nat Biotechnol. 2011 Jan;29(1):59-63. doi: 10.1038/nbt.1740. Epub 2010 Dec 19.

MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes.MaCH：利用序列和基因型数据来估计单倍型和未观测基因型。

Genet Epidemiol. 2010 Dec;34(8):816-34. doi: 10.1002/gepi.20533.

Integrating common and rare genetic variation in diverse human populations.整合不同人类群体中的常见和罕见遗传变异。

Nature. 2010 Sep 2;467(7311):52-8. doi: 10.1038/nature09298.

Genotype imputation for genome-wide association studies.全基因组关联研究中的基因型推断。

Nat Rev Genet. 2010 Jul;11(7):499-511. doi: 10.1038/nrg2796.

Genome-wide association study of ulcerative colitis identifies three new susceptibility loci, including the HNF4A region.全基因组关联研究溃疡性结肠炎确定三个新的易感位点，包括 HNF4A 区域。

Nat Genet. 2009 Dec;41(12):1330-4. doi: 10.1038/ng.483. Epub 2009 Nov 15.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验