Department of Computational Biology, Cornell University, Ithaca, NY, USA.
Department of Genetic Medicine, Weill Cornell Medicine, New York, NY, USA.
BMC Bioinformatics. 2020 May 7;21(1):178. doi: 10.1186/s12859-020-3387-z.
Heterogeneity in the definition and measurement of complex diseases in Genome-Wide Association Studies (GWAS) may lead to misdiagnoses and misclassification errors that can significantly impact discovery of disease loci. While well appreciated, almost all analyses of GWAS data consider reported disease phenotype values as is without accounting for potential misclassification.
Here, we introduce Phenotype Latent variable Extraction of disease misdiagnosis (PheLEx), a GWAS analysis framework that learns and corrects misclassified phenotypes using structured genotype associations within a dataset. PheLEx consists of a hierarchical Bayesian latent variable model, where inference of differential misclassification is accomplished using filtered genotypes while implementing a full mixed model to account for population structure and genetic relatedness in study populations. Through simulations, we show that the PheLEx framework dramatically improves recovery of the correct disease state when considering realistic allele effect sizes compared to existing methodologies designed for Bayesian recovery of disease phenotypes. We also demonstrate the potential of PheLEx for extracting new potential loci from existing GWAS data by analyzing bipolar disorder and epilepsy phenotypes available from the UK Biobank. From the PheLEx analysis of these data, we identified new candidate disease loci not previously reported for these datasets that have value for supplemental hypothesis generation.
PheLEx shows promise in reanalyzing GWAS datasets to provide supplemental candidate loci that are ignored by traditional GWAS analysis methodologies.
全基因组关联研究(GWAS)中复杂疾病定义和测量的异质性可能导致误诊和分类错误,这会严重影响疾病基因座的发现。虽然这一点已经得到充分认识,但几乎所有 GWAS 数据分析都将报告的疾病表型值视为未经考虑潜在分类错误的原始值。
在这里,我们引入了疾病误诊表型潜在变量提取(PheLEx),这是一种 GWAS 分析框架,它使用数据集中的结构化基因型关联来学习和纠正分类错误的表型。PheLEx 由一个分层贝叶斯潜在变量模型组成,其中通过过滤基因型进行差异误诊的推断,同时实施全混合模型以解释研究人群中的群体结构和遗传相关性。通过模拟,我们表明与专门用于贝叶斯恢复疾病表型的现有方法相比,当考虑到现实的等位基因效应大小时,PheLEx 框架可大大提高正确疾病状态的恢复程度。我们还通过分析来自英国生物库的双相情感障碍和癫痫表型,展示了 PheLEx 从现有 GWAS 数据中提取新的潜在基因座的潜力。从这些数据的 PheLEx 分析中,我们确定了以前未报告的这些数据集的新候选疾病基因座,这些基因座对补充假说生成具有价值。
PheLEx 有望重新分析 GWAS 数据集,提供被传统 GWAS 分析方法忽略的补充候选基因座。