通过调整补充数据改进特定人群的等位基因频率估计：一种经验贝叶斯方法。

IMPROVING POPULATION-SPECIFIC ALLELE FREQUENCY ESTIMATES BY ADAPTING SUPPLEMENTAL DATA: AN EMPIRICAL BAYES APPROACH.

作者信息

Coram Marc, Tang Hua

机构信息

Department of Health Research and Policy, Stanford University, Stanford, California 94305, USA.

出版信息

Ann Appl Stat. 2007 Dec 12;1(2):459-479. doi: 10.1214/07-aoas121.

DOI:10.1214/07-aoas121

PMID:21451739

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3065192/

Abstract

Estimation of the allele frequency at genetic markers is a key ingredient in biological and biomedical research, such as studies of human genetic variation or of the genetic etiology of heritable traits. As genetic data becomes increasingly available, investigators face a dilemma: when should data from other studies and population subgroups be pooled with the primary data? Pooling additional samples will generally reduce the variance of the frequency estimates; however, used inappropriately, pooled estimates can be severely biased due to population stratification. Because of this potential bias, most investigators avoid pooling, even for samples with the same ethnic background and residing on the same continent. Here, we propose an empirical Bayes approach for estimating allele frequencies of single nucleotide polymorphisms. This procedure adaptively incorporates genotypes from related samples, so that more similar samples have a greater influence on the estimates. In every example we have considered, our estimator achieves a mean squared error (MSE) that is smaller than either pooling or not, and sometimes substantially improves over both extremes. The bias introduced is small, as is shown by a simulation study that is carefully matched to a real data example. Our method is particularly useful when small groups of individuals are genotyped at a large number of markers, a situation we are likely to encounter in a genome-wide association study.

摘要

估计遗传标记的等位基因频率是生物学和生物医学研究中的关键要素，例如在人类遗传变异研究或可遗传性状的遗传病因学研究中。随着遗传数据越来越容易获取，研究人员面临一个困境：何时应将其他研究和人群亚组的数据与主要数据合并？合并额外的样本通常会降低频率估计值的方差；然而，如果使用不当，由于群体分层，合并估计值可能会出现严重偏差。由于存在这种潜在偏差，大多数研究人员避免合并，即使是对于具有相同种族背景且居住在同一大陆的样本也是如此。在此，我们提出一种经验贝叶斯方法来估计单核苷酸多态性的等位基因频率。该程序会自适应地纳入相关样本的基因型，从而使更相似的样本对估计值有更大的影响。在我们考虑的每个例子中，我们的估计器实现的均方误差（MSE）比合并或不合并的情况都要小，有时在两种极端情况下都有显著改善。如一项与实际数据示例仔细匹配的模拟研究所示，引入的偏差很小。当对一小群个体进行大量标记的基因分型时，我们的方法特别有用，这种情况在全基因组关联研究中很可能会遇到。

相似文献

IMPROVING POPULATION-SPECIFIC ALLELE FREQUENCY ESTIMATES BY ADAPTING SUPPLEMENTAL DATA: AN EMPIRICAL BAYES APPROACH.通过调整补充数据改进特定人群的等位基因频率估计：一种经验贝叶斯方法。

Ann Appl Stat. 2007 Dec 12;1(2):459-479. doi: 10.1214/07-aoas121.

Estimating the effect of SNP genotype on quantitative traits from pooled DNA samples.从混合 DNA 样本估计 SNP 基因型对数量性状的影响。

Genet Sel Evol. 2012 Apr 17;44(1):12. doi: 10.1186/1297-9686-44-12.

A unified approach for allele frequency estimation, SNP detection and association studies based on pooled sequencing data using EM algorithms.基于 EM 算法的基于测序数据的等位基因频率估计、SNP 检测和关联研究的统一方法。

BMC Genomics. 2013;14 Suppl 1(Suppl 1):S1. doi: 10.1186/1471-2164-14-S1-S1. Epub 2013 Jan 21.

An empirical Bayes approach to improving population-specific genetic association estimation by leveraging cross-population data.利用跨人群数据提高基于人群的遗传关联估计的经验贝叶斯方法。

Genet Epidemiol. 2023 Feb;47(1):45-60. doi: 10.1002/gepi.22501. Epub 2022 Sep 18.

Impact and quantification of the sources of error in DNA pooling designs.DNA混合设计中误差来源的影响及量化

Ann Hum Genet. 2009 Jan;73(1):118-24. doi: 10.1111/j.1469-1809.2008.00486.x. Epub 2008 Oct 15.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区，服用抗叶酸抗疟药物的人群中，叶酸补充剂与疟疾易感性和严重程度的关系。

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

Estimating allele frequency from next-generation sequencing of pooled mitochondrial DNA samples.从混合线粒体DNA样本的下一代测序中估计等位基因频率。

Front Genet. 2011 Aug 17;2:51. doi: 10.3389/fgene.2011.00051. eCollection 2011.

How to optimize the precision of allele and haplotype frequency estimates using pooled-sequencing data.如何使用汇集测序数据优化等位基因和单倍型频率估计的精度。

Mol Ecol Resour. 2018 Mar;18(2):194-203. doi: 10.1111/1755-0998.12723. Epub 2017 Nov 4.

Cost-effective genome-wide estimation of allele frequencies from pooled DNA in Atlantic salmon (Salmo salar L.).从大西洋鲑鱼（Salmo salar L.）混合 DNA 中进行经济有效的全基因组等位基因频率估计。

BMC Genomics. 2013 Jan 16;14:12. doi: 10.1186/1471-2164-14-12.

Estimation of population allele frequencies from next-generation sequencing data: pool-versus individual-based genotyping.基于下一代测序数据的群体等位基因频率估计：基于池与个体的基因分型。

Mol Ecol. 2013 Jul;22(14):3766-79. doi: 10.1111/mec.12360. Epub 2013 Jun 4.

引用本文的文献

BETASEQ: a powerful novel method to control type-I error inflation in partially sequenced data for rare variant association testing.BETASEQ：一种强大的新方法，用于控制部分测序数据中罕见变异关联测试的 I 型错误膨胀。

Bioinformatics. 2014 Feb 15;30(4):480-7. doi: 10.1093/bioinformatics/btt719. Epub 2013 Dec 12.

Integrating genetic and gene expression evidence into genome-wide association analysis of gene sets.将遗传和基因表达证据整合到基因集的全基因组关联分析中。

Genome Res. 2012 Feb;22(2):386-97. doi: 10.1101/gr.124370.111. Epub 2011 Sep 22.

Effective sample size: Quick estimation of the effect of related samples in genetic case-control association analyses.有效样本量：遗传病例对照关联分析中相关样本效应的快速估计。

Comput Biol Chem. 2011 Feb;35(1):40-9. doi: 10.1016/j.compbiolchem.2010.12.006. Epub 2011 Jan 22.

Estimating the number of unseen variants in the human genome.估算人类基因组中未发现变异的数量。

Proc Natl Acad Sci U S A. 2009 Mar 31;106(13):5008-13. doi: 10.1073/pnas.0807815106. Epub 2009 Mar 10.

本文引用的文献

The genetical structure of populations.种群的遗传结构。

Ann Eugen. 1951 Mar;15(4):323-54. doi: 10.1111/j.1469-1809.1949.tb02451.x.

Evolution in Mendelian Populations.孟德尔群体中的进化。

Genetics. 1931 Mar;16(2):97-159. doi: 10.1093/genetics/16.2.97.

Positive natural selection in the human lineage.人类谱系中的正向自然选择。

Science. 2006 Jun 16;312(5780):1614-20. doi: 10.1126/science.1124309.

A map of recent positive selection in the human genome.人类基因组中近期正选择图谱。

PLoS Biol. 2006 Mar;4(3):e72. doi: 10.1371/journal.pbio.0040072. Epub 2006 Mar 7.

Hardy-Weinberg disequilibrium identified genotyping error of the serotonin transporter (SLC6A4) promoter polymorphism.哈迪-温伯格不平衡揭示了血清素转运体（SLC6A4）启动子多态性的基因分型错误。

Psychiatr Genet. 2006 Feb;16(1):31-4. doi: 10.1097/01.ypg.0000174393.79883.05.

Clines, clusters, and the effect of study design on the inference of human population structure.cline、聚类以及研究设计对人类群体结构推断的影响。

PLoS Genet. 2005 Dec;1(6):e70. doi: 10.1371/journal.pgen.0010070. Epub 2005 Dec 9.

Population stratification confounds genetic association studies among Latinos.人群分层混淆了拉丁裔人群中的基因关联研究。

Hum Genet. 2006 Jan;118(5):652-64. doi: 10.1007/s00439-005-0071-3. Epub 2005 Nov 8.

A haplotype map of the human genome.人类基因组单倍型图谱。

Nature. 2005 Oct 27;437(7063):1299-320. doi: 10.1038/nature04226.

Ascertainment bias in studies of human genome-wide polymorphism.人类全基因组多态性研究中的确定偏倚。

Genome Res. 2005 Nov;15(11):1496-502. doi: 10.1101/gr.4107905.

Support from the relationship of genetic and geographic distance in human populations for a serial founder effect originating in Africa.人类群体中遗传距离和地理距离的关系对源于非洲的连续奠基者效应的支持。

Proc Natl Acad Sci U S A. 2005 Nov 1;102(44):15942-7. doi: 10.1073/pnas.0507611102. Epub 2005 Oct 21.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验