Ekstrøm Claus T, Feenstra Bjarke
University of Southern Denmark, Biostatistics, Faculty of Health Sciences, Denmark.
Stat Appl Genet Mol Biol. 2012;11(3):Article 13. doi: 10.1515/1544-6115.1772.
Genetic association studies require that the genotype data from a given person can be correctly linked to the phenotype data from the same person. However, sample misidentification errors sometimes happen, whereby the link becomes invalid for some of the subjects in a study. This can have substantial consequences in terms of power to detect truly associated variants. In family-based studies, Mendelian inconsistencies can be used to detect sample misidentification. Genome-wide association studies (GWAS), however, typically use unrelated individuals, making error detection more problematic. Here we present a method for identifying potential sample misidentifications in GWAS and other genetic association studies building on ideas from forensic sciences. A widely used ad-hoc method for error detection is to check if the sex of an individual matches its X-linked genotype. We generalize this idea to less stringent associations between known genotypes and phenotypes, and show that if several known associations are combined, the power to detect misidentifications increases substantially. Individuals with an unlikely set of phenotypes given their genotypes are flagged as potential errors. We provide analytical and simulation results comparing the odds that the genotype and phenotype are both from the same individual for different numbers of available genotype-p henotype associations and for different information content of the associations. Our method has good sensitivity and specificity with as few as ten moderately informative genotype-phenotype associations. We apply the method to GWAS data from the Danish National Birth Cohort.
基因关联研究要求来自特定个体的基因型数据能够正确地与来自同一个体的表型数据相关联。然而,样本误识别错误有时会发生,从而导致研究中的某些受试者的这种关联变得无效。这在检测真正相关变异的效能方面可能会产生重大后果。在基于家系的研究中,孟德尔不一致性可用于检测样本误识别。然而,全基因组关联研究(GWAS)通常使用无亲缘关系的个体,这使得错误检测变得更具问题。在此,我们基于法医学的理念,提出一种在GWAS和其他基因关联研究中识别潜在样本误识别的方法。一种广泛使用的临时错误检测方法是检查个体的性别是否与其X连锁基因型匹配。我们将这一理念推广到已知基因型和表型之间不太严格的关联,并表明如果将几个已知关联结合起来,检测误识别的效能会大幅提高。根据其基因型具有一组不太可能的表型的个体被标记为潜在错误。我们提供了分析和模拟结果,比较了对于不同数量的可用基因型 - 表型关联以及不同关联信息含量,基因型和表型均来自同一个体的概率。我们的方法在仅有十个中等信息量的基因型 - 表型关联时就具有良好的敏感性和特异性。我们将该方法应用于丹麦国家出生队列的GWAS数据。