Department of Statistics, University of Michigan, 451 West Hall, 1085 South University, Ann Arbor, MI, USA.
Department of Electrical Engineering, Stanford University, 350 Serra Mall, Stanford, CA, USA.
Biostatistics. 2021 Jan 28;22(1):181-197. doi: 10.1093/biostatistics/kxz024.
The goal of expression quantitative trait loci (eQTL) studies is to identify the genetic variants that influence the expression levels of the genes in an organism. High throughput technology has made such studies possible: in a given tissue sample, it enables us to quantify the expression levels of approximately 20 000 genes and to record the alleles present at millions of genetic polymorphisms. While obtaining this data is relatively cheap once a specimen is at hand, obtaining human tissue remains a costly endeavor: eQTL studies continue to be based on relatively small sample sizes, with this limitation particularly serious for tissues as brain, liver, etc.-often the organs of most immediate medical relevance. Given the high-dimensional nature of these datasets and the large number of hypotheses tested, the scientific community has adopted early on multiplicity adjustment procedures. These testing procedures primarily control the false discoveries rate for the identification of genetic variants with influence on the expression levels. In contrast, a problem that has not received much attention to date is that of providing estimates of the effect sizes associated with these variants, in a way that accounts for the considerable amount of selection. Yet, given the difficulty of procuring additional samples, this challenge is of practical importance. We illustrate in this work how the recently developed conditional inference approach can be deployed to obtain confidence intervals for the eQTL effect sizes with reliable coverage. The procedure we propose is based on a randomized hierarchical strategy with a 2-fold contribution: (1) it reflects the selection steps typically adopted in state of the art investigations and (2) it introduces the use of randomness instead of data-splitting to maximize the use of available data. Analysis of the GTEx Liver dataset (v6) suggests that naively obtained confidence intervals would likely not cover the true values of effect sizes and that the number of local genetic polymorphisms influencing the expression level of genes might be underestimated.
表达数量性状基因座 (eQTL) 研究的目的是确定影响生物体中基因表达水平的遗传变异。高通量技术使得此类研究成为可能:在给定的组织样本中,它使我们能够量化大约 20000 个基因的表达水平,并记录数百万个遗传多态性位点的等位基因。虽然一旦有了样本,获取这些数据相对来说比较便宜,但获取人体组织仍然是一项昂贵的工作:eQTL 研究仍然基于相对较小的样本量,对于大脑、肝脏等组织,这一限制尤其严重,因为这些组织通常与最直接的医学相关性。考虑到这些数据集的高维性质和测试的假设数量众多,科学界很早就采用了多重调整程序。这些测试程序主要控制假发现率,以识别对基因表达水平有影响的遗传变异。相比之下,到目前为止,一个尚未受到太多关注的问题是,如何以一种考虑到大量选择的方式,提供与这些变异相关的效应大小的估计值。然而,由于获取额外样本的困难,这一挑战具有实际意义。我们在这项工作中说明了如何利用最近开发的条件推断方法来获得具有可靠覆盖范围的 eQTL 效应大小的置信区间。我们提出的程序基于随机分层策略,具有两个贡献:(1)它反映了当前最先进研究中通常采用的选择步骤;(2)它引入了随机性的使用,而不是数据分割,以最大化可用数据的使用。对 GTEx 肝脏数据集(v6)的分析表明,天真地获得的置信区间可能不会覆盖效应大小的真实值,并且影响基因表达水平的局部遗传多态性数量可能被低估。