School of Statistics, University of Minnesota, Minneapolis, Minnesota, USA.
University of Minnesota, Division of Biostatistics, School of Public Health, Minneapolis, Minnesota, USA.
Genet Epidemiol. 2022 Dec;46(8):572-588. doi: 10.1002/gepi.22491. Epub 2022 Jun 29.
Transcriptome-Wide Association Studies (TWASs) have become increasingly popular in identifying genes (or other endophenotypes or exposures) associated with complex traits. In TWAS, one first builds a predictive model for gene expressions using an expression quantitative trait loci (eQTL) data set in stage 1, then tests the association between the predicted gene expression and a trait based on a large, independent genome-wide association study (GWAS) data set in stage 2. However, since the sample size of the eQTL data set is usually small and the coefficient of multiple determination (i.e., ) of the model for many genes is also small, a question of interest is to what extent these factors affect the statistical power of TWAS. In addition, in contrast to a standard (univariate) TWAS (UV-TWAS) considering only a single gene at a time, multivariate TWAS (MV-TWAS) methods have recently emerged to account for the effects of multiple genes, or a gene's nonlinear effects, simultaneously. With the absence of the power analysis for these MV-TWAS methods, it would be of interest to investigate whether one can gain or lose power by using the newly proposed MV-TWAS instead of UV-TWAS. In this paper, we first outline a general method for sample size/power calculations for two-sample TWAS, then use real data-the Alzheimer's Disease Neuroimaging Initiative (ADNI) expression quantitative trait loci (eQTL) data and the Genotype-Tissue Expression (GTEx) eQTL data for stage 1, the International Genomics of Alzheimer's Project Alzheimer's disease (AD) GWAS summary data and UK Biobank (UKB) individual-level data for stage 2-to empirically address these questions. Our most important conclusions are the following. First, a sample size of a few thousands (~8000) would suffice in stage 1, where the power of TWAS would be more determined by cis-heritability of gene expression. Second, as in the general case of simple regression versus multiple regression, the power of MV-TWAS may be higher or lower than that of UV-TWAS, depending on the specific relationships among the GWAS trait and multiple genes (or linear and nonlinear terms of the same gene's expression levels), such as their correlations and effect sizes. Interestingly, several top genes with large power gains in MV-TWAS (over that in UV-TWAS) were known to be (and in our data more significantly) associated with AD. We also reached similar conclusions in an application to the GTEx whole blood gene expression data and UKB GWAS data of high-density lipoprotein cholesterol. The proposed method and the conclusions are expected to be useful in planning and designing future TWAS and other related studies (e.g., Proteome- or Metabolome-Wide Association Studies) when determining the sample sizes for the two stages.
转录组关联研究(TWAS)已成为鉴定与复杂性状相关的基因(或其他内表型或暴露因素)的一种越来越受欢迎的方法。在 TWAS 中,首先使用阶段 1 中的表达数量性状基因座(eQTL)数据集构建基因表达的预测模型,然后使用来自大型独立全基因组关联研究(GWAS)数据集的预测基因表达和性状之间的关联在阶段 2 中进行测试。然而,由于 eQTL 数据集的样本量通常较小,并且许多基因模型的多重确定系数(即 )也较小,因此一个感兴趣的问题是这些因素在多大程度上影响 TWAS 的统计功效。此外,与仅一次考虑单个基因的标准(单变量)TWAS(UV-TWAS)相比,最近出现了多变量 TWAS(MV-TWAS)方法,以同时考虑多个基因或基因的非线性效应的影响。由于缺乏这些 MV-TWAS 方法的功效分析,因此研究使用新提出的 MV-TWAS 而不是 UV-TWAS 是否可以获得或失去功效将是一件很有意义的事情。在本文中,我们首先概述了两样本 TWAS 的样本量/功效计算的一般方法,然后使用实际数据——阿尔茨海默病神经影像学倡议(ADNI)表达数量性状基因座(eQTL)数据和基因型组织表达(GTEx)eQTL 数据进行阶段 1,国际阿尔茨海默病基因组学项目阿尔茨海默病(AD)GWAS 汇总数据和英国生物库(UKB)个体水平数据进行阶段 2,以经验性地解决这些问题。我们最重要的结论如下。首先,在阶段 1 中,几千个(约 8000 个)样本量就足够了,TWAS 的功效将更多地取决于基因表达的顺式遗传力。其次,与简单回归与多元回归的一般情况一样,MV-TWAS 的功效可能高于或低于 UV-TWAS,这取决于 GWAS 性状与多个基因(或同一基因表达水平的线性和非线性项)之间的具体关系,例如它们的相关性和效应大小。有趣的是,MV-TWAS 中具有较高功效增益的几个顶级基因(高于 UV-TWAS)被认为是(并且在我们的数据中更为显著)与 AD 相关。我们在对 GTEx 全血基因表达数据和 UKB 高密度脂蛋白胆固醇 GWAS 数据的应用中也得出了类似的结论。当确定两个阶段的样本量时,所提出的方法和结论有望在规划和设计未来的 TWAS 和其他相关研究(例如,蛋白质组学或代谢组学关联研究)时提供有用的信息。