Department of Biostatistics, University of Washington, Seattle, Washington, USA.
Genet Epidemiol. 2010 Sep;34(6):591-602. doi: 10.1002/gepi.20516.
Genome-wide scans of nucleotide variation in human subjects are providing an increasing number of replicated associations with complex disease traits. Most of the variants detected have small effects and, collectively, they account for a small fraction of the total genetic variance. Very large sample sizes are required to identify and validate findings. In this situation, even small sources of systematic or random error can cause spurious results or obscure real effects. The need for careful attention to data quality has been appreciated for some time in this field, and a number of strategies for quality control and quality assurance (QC/QA) have been developed. Here we extend these methods and describe a system of QC/QA for genotypic data in genome-wide association studies (GWAS). This system includes some new approaches that (1) combine analysis of allelic probe intensities and called genotypes to distinguish gender misidentification from sex chromosome aberrations, (2) detect autosomal chromosome aberrations that may affect genotype calling accuracy, (3) infer DNA sample quality from relatedness and allelic intensities, (4) use duplicate concordance to infer SNP quality, (5) detect genotyping artifacts from dependence of Hardy-Weinberg equilibrium test P-values on allelic frequency, and (6) demonstrate sensitivity of principal components analysis to SNP selection. The methods are illustrated with examples from the "Gene Environment Association Studies" (GENEVA) program. The results suggest several recommendations for QC/QA in the design and execution of GWAS.
全基因组范围内对人类核苷酸变异的扫描为复杂疾病性状提供了越来越多的可重复关联。大多数检测到的变体具有较小的影响,它们共同仅占总遗传变异的一小部分。需要非常大的样本量才能识别和验证发现。在这种情况下,即使是系统或随机误差的微小来源也可能导致虚假结果或掩盖真实影响。在该领域,人们已经意识到一段时间以来对数据质量的谨慎关注,并且已经开发出许多质量控制和质量保证 (QC/QA) 策略。在这里,我们扩展了这些方法,并描述了一种全基因组关联研究 (GWAS) 中基因型数据的 QC/QA 系统。该系统包括一些新方法,这些方法 (1) 结合等位探针强度和已命名基因型的分析,以区分性别鉴定错误和性染色体异常,(2) 检测可能影响基因型呼叫准确性的常染色体异常,(3) 从相关性和等位基因强度推断 DNA 样本质量,(4) 使用重复一致性来推断 SNP 质量,(5) 从 Hardy-Weinberg 平衡测试 P 值对等位基因频率的依赖性检测基因分型伪影,以及 (6) 展示主成分分析对 SNP 选择的敏感性。该方法通过“基因环境关联研究”(GENEVA) 计划的示例进行说明。结果表明,在 GWAS 的设计和执行中,有几个关于 QC/QA 的建议。