一种用于连锁不平衡估计的图形建模方法的准确性和计算效率。

Accuracy and computational efficiency of a graphical modeling approach to linkage disequilibrium estimation.

作者信息

Abel Haley J, Thomas Alun

机构信息

University of Utah, Utah, USA.

出版信息

Stat Appl Genet Mol Biol. 2011;10(1):Article 5. doi: 10.2202/1544-6115.1615. Epub 2011 Jan 6.

DOI:10.2202/1544-6115.1615

PMID:21291415

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3045084/

Abstract

We develop recent work on using graphical models for linkage disequilibrium to provide efficient programs for model fitting, phasing, and imputation of missing data in large data sets. Two important features contribute to the computational efficiency: the separation of the model fitting and phasing-imputation processes into different programs, and holding in memory only the data within a moving window of loci during model fitting. Optimal parameter values were chosen by cross-validation to maximize the probability of correctly imputing masked genotypes. The best accuracy obtained is slightly below than that from the Beagle program of Browning and Browning, and our fitting program is slower. However, for large data sets, it uses less storage. For a reference set of n individuals genotyped at m markers, the time and storage required for fitting a graphical model are approximately O(nm) and O(n+m), respectively. To impute the phases and missing data on n individuals using an already fitted graphical model requires O(nm) time and O(m) storage. While the times for fitting and imputation are both O(nm), the imputation process is considerably faster; thus, once a model is estimated from a reference data set, the marginal cost of phasing and imputing further samples is very low.

摘要

我们拓展了近期关于使用图形模型进行连锁不平衡分析的工作，以提供高效的程序来拟合模型、进行定相以及对大数据集中的缺失数据进行插补。有两个重要特性有助于提高计算效率：将模型拟合与定相 - 插补过程分离到不同程序中，以及在模型拟合期间仅在内存中保留基因座移动窗口内的数据。通过交叉验证选择最优参数值，以最大化正确插补掩码基因型的概率。所获得的最佳准确性略低于Browning和Browning的Beagle程序，并且我们的拟合程序速度较慢。然而，对于大数据集，它占用的存储空间更少。对于在m个标记上进行基因分型的n个个体的参考集，拟合图形模型所需的时间和存储空间分别约为O(nm)和O(n + m)。使用已拟合的图形模型对n个个体的定相和缺失数据进行插补需要O(nm)时间和O(m)存储空间。虽然拟合和插补的时间均为O(nm)，但插补过程要快得多；因此，一旦从参考数据集中估计出模型，对更多样本进行定相和插补的边际成本就非常低。

相似文献

Accuracy and computational efficiency of a graphical modeling approach to linkage disequilibrium estimation.一种用于连锁不平衡估计的图形建模方法的准确性和计算效率。

Stat Appl Genet Mol Biol. 2011;10(1):Article 5. doi: 10.2202/1544-6115.1615. Epub 2011 Jan 6.

Imputation of missing single nucleotide polymorphism genotypes using a multivariate mixed model framework.使用多元混合模型框架对缺失的单核苷酸多态性基因型进行推断。

J Anim Sci. 2011 Jul;89(7):2042-9. doi: 10.2527/jas.2010-3297. Epub 2011 Feb 25.

Towards linkage analysis with markers in linkage disequilibrium by graphical modelling.通过图形建模对处于连锁不平衡状态的标记进行连锁分析。

Hum Hered. 2007;64(1):16-26. doi: 10.1159/000101419. Epub 2007 Apr 27.

A comprehensive evaluation of SNP genotype imputation.单核苷酸多态性（SNP）基因型填充的综合评估。

Hum Genet. 2009 Mar;125(2):163-71. doi: 10.1007/s00439-008-0606-5. Epub 2008 Dec 17.

Phasing quality assessment in a brown layer population through family- and population-based software.通过基于家系和群体的软件对棕色层群体进行分相质量评估。

BMC Genet. 2019 Jul 17;20(1):57. doi: 10.1186/s12863-019-0759-3.

Accuracy of genotype imputation in sheep breeds.绵羊品种基因型推断的准确性。

Anim Genet. 2012 Feb;43(1):72-80. doi: 10.1111/j.1365-2052.2011.02208.x. Epub 2011 May 27.

A hidden markov model combining linkage and linkage disequilibrium information for haplotype reconstruction and quantitative trait locus fine mapping.一种结合连锁和连锁不平衡信息的隐马尔可夫模型，用于单倍型重建和数量性状基因座精细定位。

Genetics. 2010 Mar;184(3):789-98. doi: 10.1534/genetics.109.108431. Epub 2009 Dec 14.

The use of family relationships and linkage disequilibrium to impute phase and missing genotypes in up to whole-genome sequence density genotypic data.利用家族关系和连锁不平衡信息，对全基因组序列密度基因型数据进行相位推断和缺失基因型填补。

Genetics. 2010 Aug;185(4):1441-9. doi: 10.1534/genetics.110.113936. Epub 2010 May 17.

Extent of linkage disequilibrium, consistency of gametic phase, and imputation accuracy within and across Canadian dairy breeds.加拿大奶牛品种内和品种间的连锁不平衡程度、配子相位一致性及填充准确性。

J Dairy Sci. 2014 May;97(5):3128-41. doi: 10.3168/jds.2013-6826. Epub 2014 Feb 26.

Imputation of missing genotypes from sparse to high density using long-range phasing.利用长程定相对稀疏至高密度缺失基因型进行推断。

Genetics. 2011 Sep;189(1):317-27. doi: 10.1534/genetics.111.128082. Epub 2011 Jul 29.

引用本文的文献

Family Study Designs Informed by Tumor Heterogeneity and Multi-Cancer Pleiotropies: The Power of the Utah Population Database.基于肿瘤异质性和多癌多效性的家系研究设计：犹他州人口数据库的作用

Cancer Epidemiol Biomarkers Prev. 2020 Apr;29(4):807-815. doi: 10.1158/1055-9965.EPI-19-0912. Epub 2020 Feb 25.

Reparameterization of PAM50 Expression Identifies Novel Breast Tumor Dimensions and Leads to Discovery of a Genome-Wide Significant Breast Cancer Locus at .PAM50 表达的重参数化可识别新的乳腺肿瘤维度，并导致在. 发现全基因组显著的乳腺癌基因座。

Cancer Epidemiol Biomarkers Prev. 2018 Jun;27(6):644-652. doi: 10.1158/1055-9965.EPI-17-0887. Epub 2018 Apr 12.

A method combining a random forest-based technique with the modeling of linkage disequilibrium through latent variables, to run multilocus genome-wide association studies.一种结合基于随机森林的技术和通过潜在变量进行连锁不平衡建模的方法，用于进行多基因座全基因组关联研究。

BMC Bioinformatics. 2018 Mar 27;19(1):106. doi: 10.1186/s12859-018-2054-0.

Novel pedigree analysis implicates DNA repair and chromatin remodeling in multiple myeloma risk.新的家系分析提示 DNA 修复和染色质重塑与多发性骨髓瘤风险相关。

PLoS Genet. 2018 Feb 1;14(2):e1007111. doi: 10.1371/journal.pgen.1007111. eCollection 2018 Feb.

Modelling and visualizing fine-scale linkage disequilibrium structure.建模和可视化精细连锁不平衡结构。

BMC Bioinformatics. 2013 Jun 6;14:179. doi: 10.1186/1471-2105-14-179.

Employing MCMC under the PPL framework to analyze sequence data in large pedigrees.在 PPL 框架下使用 MCMC 分析大型家系中的序列数据。

Front Genet. 2013 Apr 19;4:59. doi: 10.3389/fgene.2013.00059. eCollection 2013.

Pairwise shared genomic segment analysis in three Utah high-risk breast cancer pedigrees.三户犹他州高危乳腺癌家系中基因组片段的成对共享分析。

BMC Genomics. 2012 Nov 28;13:676. doi: 10.1186/1471-2164-13-676.

Case-control association testing by graphical modeling for the Genetic Analysis Workshop 17 mini-exome sequence data.利用图形模型对遗传分析研讨会17的小外显子序列数据进行病例对照关联测试。

BMC Proc. 2011 Nov 29;5 Suppl 9(Suppl 9):S62. doi: 10.1186/1753-6561-5-S9-S62.

本文引用的文献

Enumerating the junction trees of a decomposable graph.枚举可分解图的连接树。

J Comput Graph Stat. 2009 Dec 1;18(4):930-940. doi: 10.1198/jcgs.2009.07129.

Enumerating the decomposable neighbours of a decomposable graph under a simple perturbation scheme.在一种简单扰动方案下枚举可分解图的可分解邻域。

Comput Stat Data Anal. 2009 Feb 15;53(4):1232-1238. doi: 10.1016/j.csda.2008.10.029.

Estimation of graphical models whose conditional independence graphs are interval graphs and its application to modeling linkage disequilibrium.条件独立图为区间图的图形模型估计及其在连锁不平衡建模中的应用。

Comput Stat Data Anal. 2009 Mar 15;53(5):1818-1828. doi: 10.1016/j.csda.2008.02.003.

Simultaneous genotype calling and haplotype phasing improves genotype accuracy and reduces false-positive associations for genome-wide association studies.同时进行基因型调用和单倍型相位分析可提高全基因组关联研究的基因型准确性，并减少假阳性关联。

Am J Hum Genet. 2009 Dec;85(6):847-61. doi: 10.1016/j.ajhg.2009.11.004.

A method and program for estimating graphical models for linkage disequilibrium that scale linearly with the number of loci, and their application to gene drop simulation.一种用于估计连锁不平衡图形模型的方法和程序，该方法和程序与基因座数量呈线性比例关系，及其在基因漂流模拟中的应用。

Bioinformatics. 2009 May 15;25(10):1287-92. doi: 10.1093/bioinformatics/btp146. Epub 2009 Mar 16.

Shared genomic segment analysis. Mapping disease predisposition genes in extended pedigrees using SNP genotype assays.共享基因组片段分析。使用单核苷酸多态性（SNP）基因分型检测在扩展家系中定位疾病易感性基因。

Ann Hum Genet. 2008 Mar;72(Pt 2):279-87. doi: 10.1111/j.1469-1809.2007.00406.x. Epub 2007 Dec 18.

A second generation human haplotype map of over 3.1 million SNPs.一张包含超过310万个单核苷酸多态性的第二代人类单倍型图谱。

Nature. 2007 Oct 18;449(7164):851-61. doi: 10.1038/nature06258.

Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering.利用局部单倍型聚类对全基因组关联研究进行快速准确的单倍型分型和缺失数据推断。

Am J Hum Genet. 2007 Nov;81(5):1084-97. doi: 10.1086/521987. Epub 2007 Sep 21.

A new multipoint method for genome-wide association studies by imputation of genotypes.一种通过基因型插补进行全基因组关联研究的新的多点方法。

Nat Genet. 2007 Jul;39(7):906-13. doi: 10.1038/ng2088. Epub 2007 Jun 17.

Towards linkage analysis with markers in linkage disequilibrium by graphical modelling.通过图形建模对处于连锁不平衡状态的标记进行连锁分析。

Hum Hered. 2007;64(1):16-26. doi: 10.1159/000101419. Epub 2007 Apr 27.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验