Xue Haoran, Shen Xiaotong, Pan Wei
School of Statistics, University of Minnesota, Minneapolis, Minnesota 55455.
Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota 55455.
J Am Stat Assoc. 2023;118(543):1525-1537. doi: 10.1080/01621459.2023.2183127. Epub 2023 Mar 17.
Transcriptome-wide association studies (TWAS) have recently emerged as a popular tool to discover (putative) causal genes by integrating an outcome GWAS dataset with another gene expression/transcriptome GWAS (called eQTL) dataset. In our motivating and target application, we'd like to identify causal genes for low-density lipoprotein cholesterol (LDL), which is crucial for developing new treatments for hyperlipidemia and cardiovascular diseases. The statistical principle underlying TWAS is (two-sample) two-stage least squares (2SLS) using multiple correlated SNPs as instrumental variables (IVs); it is closely related to typical (two-sample) Mendelian randomization (MR) using independent SNPs as IVs, which is expected to be impractical and lower-powered for TWAS (and some other) applications. However, often some of the SNPs used may not be valid IVs, e.g. due to the widespread pleiotropy of their direct effects on the outcome not mediated through the gene of interest, leading to false conclusions by TWAS (or MR). Building on recent advances in sparse regression, we propose a robust and efficient inferential method to account for both hidden confounding and some invalid IVs via two-stage constrained maximum likelihood (2ScML), an extension of 2SLS. We first develop the proposed method with individual-level data, then extend it both theoretically and computationally to GWAS summary data for the most popular two-sample TWAS design, to which almost all existing robust IV regression methods are however not applicable. We show that the proposed method achieves asymptotically valid statistical inference on causal effects, demonstrating its wider applicability and superior finite-sample performance over the standard 2SLS/TWAS (and MR). We apply the methods to identify putative causal genes for LDL by integrating large-scale lipid GWAS summary data with eQTL data.
全转录组关联研究(TWAS)最近已成为一种流行的工具,通过将结果全基因组关联研究(GWAS)数据集与另一个基因表达/转录组GWAS(称为表达定量性状位点,eQTL)数据集相结合来发现(假定的)因果基因。在我们的激励性和目标应用中,我们希望识别与低密度脂蛋白胆固醇(LDL)相关的因果基因,这对于开发高脂血症和心血管疾病的新治疗方法至关重要。TWAS背后的统计原理是使用多个相关单核苷酸多态性(SNP)作为工具变量(IV)的(两样本)两阶段最小二乘法(2SLS);它与使用独立SNP作为IV的典型(两样本)孟德尔随机化(MR)密切相关,预计对于TWAS(以及其他一些)应用而言,这种方法不切实际且功效较低。然而,通常所使用的一些SNP可能不是有效的IV,例如,由于它们对结果的直接影响广泛存在多效性,并非通过感兴趣的基因介导,这会导致TWAS(或MR)得出错误结论。基于稀疏回归的最新进展,我们提出了一种稳健且有效的推断方法,通过两阶段约束最大似然法(2ScML)来解决隐藏的混杂因素和一些无效IV的问题,2ScML是2SLS的扩展。我们首先使用个体水平数据开发所提出的方法,然后在理论和计算上对其进行扩展,以适用于最流行的两样本TWAS设计的GWAS汇总数据,然而几乎所有现有的稳健IV回归方法都不适用于此。我们表明,所提出的方法在因果效应方面实现了渐近有效的统计推断,证明了其比标准2SLS/TWAS(和MR)具有更广泛的适用性和优越的有限样本性能。我们应用这些方法,通过整合大规模脂质GWAS汇总数据和eQTL数据来识别LDL的假定因果基因。
Genet Epidemiol. 2019-12-10
Genet Epidemiol. 2022-12
Am J Hum Genet. 2025-2-6
BioData Min. 2024-9-5
Am J Hum Genet. 2024-8-8
Nucleic Acids Res. 2021-1-8
Genet Epidemiol. 2020-4-6
NAR Genom Bioinform. 2020-3
Hum Genet. 2019-12-16
J Am Stat Assoc. 2018-11-13