Ray Debashree, Basu Saonli
Department of Biostatistics and Center for Statistical Genetics, University of Michigan, Ann Arbor, Michigan, United States of America.
Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, United States of America.
Genet Epidemiol. 2017 Jul;41(5):413-426. doi: 10.1002/gepi.22045. Epub 2017 Apr 10.
In the past decade, many genome-wide association studies (GWASs) have been conducted to explore association of single nucleotide polymorphisms (SNPs) with complex diseases using a case-control design. These GWASs not only collect information on the disease status (primary phenotype, D) and the SNPs (genotypes, X), but also collect extensive data on several risk factors and traits. Recent literature and grant proposals point toward a trend in reusing existing large case-control data for exploring genetic associations of some additional traits (secondary phenotypes, Y) collected during the study. These secondary phenotypes may be correlated, and a proper analysis warrants a multivariate approach. Commonly used multivariate methods are not equipped to properly account for the non-random sampling scheme. Current ad hoc practices include analyses without any adjustment, and analyses with D adjusted as a covariate. Our theoretical and empirical studies suggest that the type I error for testing genetic association of secondary traits can be substantial when X as well as Y are associated with D, even when there is no association between X and Y in the underlying (target) population. Whether using D as a covariate helps maintain type I error depends heavily on the disease mechanism and the underlying causal structure (which is often unknown). To avoid grossly incorrect inference, we have proposed proportional odds model adjusted for propensity score (POM-PS). It uses a proportional odds logistic regression of X on Y and adjusts estimated conditional probability of being diseased as a covariate. We demonstrate the validity and advantage of POM-PS, and compare to some existing methods in extensive simulation experiments mimicking plausible scenarios of dependency among Y, X, and D. Finally, we use POM-PS to jointly analyze four adiposity traits using a type 2 diabetes (T2D) case-control sample from the population-based Metabolic Syndrome in Men (METSIM) study. Only POM-PS analysis of the T2D case-control sample seems to provide valid association signals.
在过去十年中,已经开展了许多全基因组关联研究(GWAS),采用病例对照设计来探索单核苷酸多态性(SNP)与复杂疾病之间的关联。这些GWAS不仅收集疾病状态(主要表型,D)和SNP(基因型,X)的信息,还收集了有关多个风险因素和性状的大量数据。最近的文献和资助申请表明,存在一种趋势,即重新利用现有的大型病例对照数据来探索研究期间收集的一些其他性状(次要表型,Y)的遗传关联。这些次要表型可能是相关的,适当的分析需要采用多变量方法。常用的多变量方法无法妥善考虑非随机抽样方案。当前的临时做法包括不做任何调整的分析,以及将D作为协变量进行调整的分析。我们的理论和实证研究表明,当X以及Y都与D相关时,即使在基础(目标)人群中X与Y之间不存在关联,检测次要性状遗传关联的I型错误也可能很大。使用D作为协变量是否有助于维持I型错误在很大程度上取决于疾病机制和潜在的因果结构(这通常是未知的)。为了避免严重错误的推断,我们提出了倾向得分调整比例优势模型(POM-PS)。它使用X对Y的比例优势逻辑回归,并将估计的患病条件概率作为协变量进行调整。我们展示了POM-PS的有效性和优势,并在模拟Y、X和D之间合理依赖情况的广泛模拟实验中与一些现有方法进行了比较。最后,我们使用POM-PS对来自基于人群的男性代谢综合征(METSIM)研究的2型糖尿病(T2D)病例对照样本联合分析四种肥胖性状。只有对T2D病例对照样本进行POM-PS分析似乎能提供有效的关联信号。