MRC Integrative Epidemiology Unit.
UK Centre for Tobacco and Alcohol Studies.
Int J Epidemiol. 2018 Feb 1;47(1):226-235. doi: 10.1093/ije/dyx206.
Large-scale cross-sectional and cohort studies have transformed our understanding of the genetic and environmental determinants of health outcomes. However, the representativeness of these samples may be limited-either through selection into studies, or by attrition from studies over time. Here we explore the potential impact of this selection bias on results obtained from these studies, from the perspective that this amounts to conditioning on a collider (i.e. a form of collider bias). Whereas it is acknowledged that selection bias will have a strong effect on representativeness and prevalence estimates, it is often assumed that it should not have a strong impact on estimates of associations. We argue that because selection can induce collider bias (which occurs when two variables independently influence a third variable, and that third variable is conditioned upon), selection can lead to substantially biased estimates of associations. In particular, selection related to phenotypes can bias associations with genetic variants associated with those phenotypes. In simulations, we show that even modest influences on selection into, or attrition from, a study can generate biased and potentially misleading estimates of both phenotypic and genotypic associations. Our results highlight the value of knowing which population your study sample is representative of. If the factors influencing selection and attrition are known, they can be adjusted for. For example, having DNA available on most participants in a birth cohort study offers the possibility of investigating the extent to which polygenic scores predict subsequent participation, which in turn would enable sensitivity analyses of the extent to which bias might distort estimates.
大规模的横断面和队列研究改变了我们对健康结果的遗传和环境决定因素的理解。然而,这些样本的代表性可能是有限的——无论是通过选择进入研究,还是随着时间的推移从研究中流失。在这里,我们从条件因素(即一种混杂偏倚形式)的角度探讨了这种选择偏差对从这些研究中获得的结果的潜在影响。虽然人们承认选择偏差会对代表性和流行率估计产生强烈影响,但人们通常认为它不应对关联估计产生强烈影响。我们认为,由于选择可以引起混杂偏倚(当两个变量独立影响第三个变量,而第三个变量受到条件限制时发生),因此选择可能导致关联的估计值存在严重偏差。特别是,与表型相关的选择可能会使与这些表型相关的遗传变异相关的关联产生偏差。在模拟中,我们表明,即使对进入研究或从研究中流失的选择有适度的影响,也会生成表型和基因型关联的有偏差且可能具有误导性的估计值。我们的结果强调了了解您的研究样本所代表的人群的价值。如果已知影响选择和流失的因素,则可以对其进行调整。例如,在出生队列研究中大多数参与者都有 DNA 样本,这就提供了调查多基因评分预测随后参与程度的可能性,这反过来又可以对偏差可能扭曲估计值的程度进行敏感性分析。