MRC Biostatistics Unit, Institute of Public Health, Biomedical Campus, University of Cambridge, Cambridge, UK.
Department of Epidemiology and Biostatistics, School of Public Health, Imperial College London, London, UK.
Genet Epidemiol. 2022 Oct;46(7):415-429. doi: 10.1002/gepi.22462. Epub 2022 May 31.
When genetic variants in a gene cluster are associated with a disease outcome, the causal pathway from the variants to the outcome can be difficult to disentangle. For example, the chemokine receptor gene cluster contains genetic variants associated with various cytokines. Associations between variants in this cluster and stroke risk may be driven by any of these cytokines. Multivariable Mendelian randomization is an extension of standard univariable Mendelian randomization to estimate the direct effects of related exposures with shared genetic predictors. However, when genetic variants are clustered, due to being located in a single genetic region, a Goldilocks dilemma arises: including too many highly-correlated variants in the analysis can lead to ill-conditioning, but pruning variants too aggressively can lead to imprecise estimates or even lack of identification. We propose multivariable methods that use principal component analysis to reduce many correlated genetic variants into a smaller number of orthogonal components that are used as instrumental variables. We show in simulations that these methods result in more precise estimates that are less sensitive to numerical instability due to both strong correlations and small changes in the input data. We apply the methods to demonstrate the most likely causal risk factor for stroke at the chemokine gene cluster is monocyte chemoattractant protein-1.
当基因簇中的遗传变异与疾病结果相关时,从变异到结果的因果途径可能很难理清。例如,趋化因子受体基因簇包含与各种细胞因子相关的遗传变异。该簇中变异与中风风险之间的关联可能是由这些细胞因子中的任何一种驱动的。多变量孟德尔随机化是标准单变量孟德尔随机化的扩展,用于估计具有共同遗传预测因子的相关暴露的直接影响。然而,当遗传变异聚类时,由于位于单个遗传区域中,就会出现 Goldilocks 困境:在分析中包含太多高度相关的变异会导致病态,但是过于激进地修剪变异会导致估计不精确,甚至无法识别。我们提出了多变量方法,这些方法使用主成分分析将许多相关的遗传变异减少到更小数量的正交成分,这些成分被用作工具变量。我们在模拟中表明,这些方法会产生更精确的估计值,这些估计值对由于强相关性和输入数据的微小变化而导致的数值不稳定性的敏感性较低。我们应用这些方法来证明趋化因子基因簇中风的最可能的因果风险因素是单核细胞趋化蛋白-1。