Suppr超能文献

基因连锁与关联研究的设计考量

Design considerations for genetic linkage and association studies.

作者信息

Nsengimana Jérémie, Bishop D Timothy

机构信息

Section of Epidemiology and Biostatistics, Leeds Institute of Molecular Medicine, University of Leeds, Cancer Genetics Building, Leeds, UK.

出版信息

Methods Mol Biol. 2012;850:237-62. doi: 10.1007/978-1-61779-555-8_13.

Abstract

This chapter describes the main issues that genetic epidemiologists usually consider in the design of linkage and association studies. For linkage, we briefly consider the situation of rare, highly penetrant alleles showing a disease pattern consistent with Mendelian inheritance investigated through parametric methods in large pedigrees or with autozygosity mapping in inbred families, and we then turn our focus to the most common design, affected sibling pairs, of more relevance for common, complex diseases. Theoretical and more practical power and sample size calculations are provided as a function of the strength of the genetic effect being investigated. We also discuss the impact of other determinants of statistical power such as disease heterogeneity, pedigree, and genotyping errors, as well as the effect of the type and density of genetic markers. Linkage studies should be as large as possible to have sufficient power in relation to the expected genetic effect size. Segregation analysis, a formal statistical technique to describe the underlying genetic susceptibility, may assist in the estimation of the relevant parameters to apply, for instance. However, segregation analyses estimate the total genetic component rather than a single-locus effect. Locus heterogeneity should be considered when power is estimated and at the analysis stage, i.e. assuming smaller locus effect than the total the genetic component from segregation studies. Disease heterogeneity should be minimised by considering subtypes if they are well defined or by otherwise collecting known sources of heterogeneity and adjusting for them as covariates; the power will depend upon the relationship between the disease subtype and the underlying genotypes. Ultimately, identifying susceptibility alleles of modest effects (e.g. RR≤1.5) requires a number of families that seem unfeasible in a single study. Meta-analysis and data pooling between different research groups can provide a sizeable study, but both approaches require even a higher level of vigilance about locus and disease heterogeneity when data come from different populations. All necessary steps should be taken to minimise pedigree and genotyping errors at the study design stage as they are, for the most part, due to human factors. A two-stage design is more cost-effective than one stage when using short tandem repeats (STRs). However, dense single-nucleotide polymorphism (SNP) arrays offer a more robust alternative, and due to their lower cost per unit, the total cost of studies using SNPs may in the future become comparable to that of studies using STRs in one or two stages. For association studies, we consider the popular case-control design for dichotomous phenotypes, and we provide power and sample size calculations for one-stage and multistage designs. For candidate genes, guidelines are given on the prioritisation of genetic variants, and for genome-wide association studies (GWAS), the issue of choosing an appropriate SNP array is discussed. A warning is issued regarding the danger of designing an underpowered replication study following an initial GWAS. The risk of finding spurious association due to population stratification, cryptic relatedness, and differential bias is underlined. GWAS have a high power to detect common variants of high or moderate effect. For weaker effects (e.g. relative risk<1.2), the power is greatly reduced, particularly for recessive loci. While sample sizes of 10,000 or 20,000 cases are not beyond reach for most common diseases, only meta-analyses and data pooling can allow attaining a study size of this magnitude for many other diseases. It is acknowledged that detecting the effects from rare alleles (i.e. frequency<5%) is not feasible in GWAS, and it is expected that novel methods and technology, such as next-generation resequencing, will fill this gap. At the current stage, the choice of which GWAS SNP array to use does not influence the power in populations of European ancestry. A multistage design reduces the study cost but has less power than the standard one-stage design. If one opts for a multistage design, the power can be improved by jointly analysing the data from different stages for the SNPs they share. The estimates of locus contribution to disease risk from genome-wide scans are often biased, and relying on them might result in an underpowered replication study. Population structure has so far caused less spurious associations than initially feared, thanks to systematic ethnicity matching and application of standard quality control measures. Differential bias could be a more serious threat and must be minimised by strictly controlling all the aspects of DNA acquisition, storage, and processing.

摘要

本章描述了遗传流行病学家在连锁和关联研究设计中通常会考虑的主要问题。对于连锁分析,我们简要考虑通过大型家系中的参数方法或近亲家庭中的纯合性定位来研究罕见、高外显率等位基因呈现与孟德尔遗传一致的疾病模式的情况,然后我们将重点转向最常见的设计——受累同胞对,这对于常见的复杂疾病更为相关。根据所研究遗传效应的强度,提供了理论和更实用的检验效能及样本量计算方法。我们还讨论了其他影响统计检验效能的因素,如疾病异质性、家系和基因分型错误,以及遗传标记的类型和密度的影响。连锁研究应尽可能大,以便相对于预期的遗传效应大小具有足够的检验效能。例如,分离分析是一种描述潜在遗传易感性的正式统计技术,可能有助于估计要应用的相关参数。然而,分离分析估计的是总的遗传成分,而不是单基因座效应。在估计检验效能和分析阶段应考虑基因座异质性,即假设基因座效应小于分离研究中的总遗传成分。如果疾病亚型定义明确,应考虑疾病亚型来尽量减少疾病异质性,或者通过收集已知的异质性来源并将其作为协变量进行调整;检验效能将取决于疾病亚型与潜在基因型之间的关系。最终,识别效应较小的易感等位基因(例如RR≤1.5)需要大量家系,这在单个研究中似乎不可行。荟萃分析和不同研究组之间的数据合并可以提供大规模研究,但当数据来自不同人群时,这两种方法都需要对基因座和疾病异质性保持更高的警惕。在研究设计阶段应采取所有必要步骤尽量减少家系和基因分型错误,因为在很大程度上这些错误是由人为因素造成的。使用短串联重复序列(STR)时,两阶段设计比一阶段设计更具成本效益。然而,密集的单核苷酸多态性(SNP)阵列提供了一种更可靠的选择,并且由于其单位成本较低,未来使用SNP的研究总成本可能与使用STR进行一阶段或两阶段研究的成本相当。对于关联研究,我们考虑针对二分法表型的流行病例对照设计,并提供一阶段和多阶段设计方法的检验效能和样本量计算。对于候选基因,给出了遗传变异优先级排序的指导原则,对于全基因组关联研究(GWAS),讨论了选择合适SNP阵列的问题。针对在初始GWAS之后设计检验效能不足的重复研究的风险发出了警告。强调了由于人群分层、隐匿相关性和差异偏倚而发现虚假关联的风险。GWAS有很高的检验效能来检测高或中等效应的常见变异。对于较弱的效应(例如相对风险<1.2),检验效能会大大降低,特别是对于隐性基因座。虽然对于大多数常见疾病,10000或200万例样本量并非遥不可及,但对于许多其他疾病,只有荟萃分析和数据合并才能达到这种规模的研究。公认在GWAS中检测罕见等位基因(即频率<5%)的效应是不可行的,预计新一代重测序等新方法和技术将填补这一空白。在当前阶段,选择使用哪种GWAS SNP阵列不会影响欧洲血统人群的检验效能。多阶段设计降低了研究成本,但检验效能低于标准的一阶段设计。如果选择多阶段设计,可以通过联合分析不同阶段共享的SNP数据来提高检验效能。全基因组扫描对疾病风险的基因座贡献估计往往存在偏差,依赖这些估计可能会导致重复研究检验效能不足。由于系统的种族匹配和标准质量控制措施的应用,到目前为止人群结构导致的虚假关联比最初担心的要少。差异偏倚可能是一个更严重的威胁,必须通过严格控制DNA采集、储存和处理的各个方面来尽量减少。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验