高维工具变量回归的正则化方法及其在遗传基因组学中的应用

Regularization Methods for High-Dimensional Instrumental Variables Regression With an Application to Genetical Genomics.

作者信息

Lin Wei, Feng Rui, Li Hongzhe

机构信息

Department of Biostatistics and Epidemiology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104.

出版信息

J Am Stat Assoc. 2015;110(509):270-288. doi: 10.1080/01621459.2014.908125.

DOI:10.1080/01621459.2014.908125

PMID:26392642

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4573639/

Abstract

In genetical genomics studies, it is important to jointly analyze gene expression data and genetic variants in exploring their associations with complex traits, where the dimensionality of gene expressions and genetic variants can both be much larger than the sample size. Motivated by such modern applications, we consider the problem of variable selection and estimation in high-dimensional sparse instrumental variables models. To overcome the difficulty of high dimensionality and unknown optimal instruments, we propose a two-stage regularization framework for identifying and estimating important covariate effects while selecting and estimating optimal instruments. The methodology extends the classical two-stage least squares estimator to high dimensions by exploiting sparsity using sparsity-inducing penalty functions in both stages. The resulting procedure is efficiently implemented by coordinate descent optimization. For the representative regularization and a class of concave regularization methods, we establish estimation, prediction, and model selection properties of the two-stage regularized estimators in the high-dimensional setting where the dimensionality of co-variates and instruments are both allowed to grow exponentially with the sample size. The practical performance of the proposed method is evaluated by simulation studies and its usefulness is illustrated by an analysis of mouse obesity data. Supplementary materials for this article are available online.

摘要

在遗传基因组学研究中，在探索基因表达数据和遗传变异与复杂性状的关联时，联合分析这两者非常重要，其中基因表达和遗传变异的维度都可能比样本量大得多。受此类现代应用的启发，我们考虑高维稀疏工具变量模型中的变量选择和估计问题。为了克服高维度和未知最优工具变量的困难，我们提出了一个两阶段正则化框架，用于在选择和估计最优工具变量的同时识别和估计重要协变量效应。该方法通过在两个阶段使用稀疏诱导惩罚函数利用稀疏性，将经典的两阶段最小二乘估计器扩展到高维。所得过程通过坐标下降优化有效地实现。对于代表性正则化和一类凹正则化方法，我们在高维设置中建立了两阶段正则化估计器的估计、预测和模型选择性质，其中协变量和工具变量的维度都允许随样本量呈指数增长。通过模拟研究评估了所提出方法的实际性能，并通过对小鼠肥胖数据的分析说明了其有用性。本文的补充材料可在线获取。

相似文献

Regularization Methods for High-Dimensional Instrumental Variables Regression With an Application to Genetical Genomics.

J Am Stat Assoc. 2015;110(509):270-288. doi: 10.1080/01621459.2014.908125.

NETWORK-REGULARIZED HIGH-DIMENSIONAL COX REGRESSION FOR ANALYSIS OF GENOMIC DATA.

Stat Sin. 2014 Jul;24(3):1433-1459. doi: 10.5705/ss.2012.317.

Covariate-Adjusted Precision Matrix Estimation with an Application in Genetical Genomics.

Biometrika. 2013 Mar;100(1):139-156. doi: 10.1093/biomet/ass058. Epub 2012 Nov 30.

glmgraph: an R package for variable selection and predictive modeling of structured genomic data.

Bioinformatics. 2015 Dec 15;31(24):3991-3. doi: 10.1093/bioinformatics/btv497. Epub 2015 Aug 26.

Instrumental variables and inverse probability weighting for causal inference from longitudinal observational studies.

Stat Methods Med Res. 2004 Feb;13(1):17-48. doi: 10.1191/0962280204sm351ra.

Sparse logistic regression with a L1/2 penalty for gene selection in cancer classification.

BMC Bioinformatics. 2013 Jun 19;14:198. doi: 10.1186/1471-2105-14-198.

On a generalization of the test of endogeneity in a two stage least squares estimation.

J Appl Stat. 2020 Oct 26;49(3):709-721. doi: 10.1080/02664763.2020.1837084. eCollection 2022.

Estimation and Selection via Absolute Penalized Convex Minimization And Its Multistage Adaptive Applications.

J Mach Learn Res. 2012 Jun 1;13:1839-1864.

The L(1/2) regularization approach for survival analysis in the accelerated failure time model.

Comput Biol Med. 2015 Sep;64:283-90. doi: 10.1016/j.compbiomed.2014.09.002. Epub 2014 Sep 18.

Sparse multivariate factor analysis regression models and its applications to integrative genomics analysis.

Genet Epidemiol. 2017 Jan;41(1):70-80. doi: 10.1002/gepi.22018. Epub 2016 Nov 10.

引用本文的文献

Estimating causal effects with hidden confounding using instrumental variables and environments.

Electron J Stat. 2023;17(2):2849-2879. doi: 10.1214/23-ejs2160. Epub 2023 Nov 10.

Causal Inference in Transcriptome-Wide Association Studies with Invalid Instruments and GWAS Summary Data.

J Am Stat Assoc. 2023;118(543):1525-1537. doi: 10.1080/01621459.2023.2183127. Epub 2023 Mar 17.

Mapping the Genetic-Imaging-Clinical Pathway with Applications to Alzheimer's Disease.

J Am Stat Assoc. 2022;117(540):1656-1668. doi: 10.1080/01621459.2022.2087658. Epub 2022 Jul 19.

DOUBLY DEBIASED LASSO: HIGH-DIMENSIONAL INFERENCE UNDER HIDDEN CONFOUNDING.

Ann Stat. 2022 Jun;50(3):1320-1347. doi: 10.1214/21-aos2152. Epub 2022 Jun 16.

Statistical methods for Mendelian randomization in genome-wide association studies: A review.

Comput Struct Biotechnol J. 2022 May 14;20:2338-2351. doi: 10.1016/j.csbj.2022.05.015. eCollection 2022.

Negative binomial factor regression with application to microbiome data analysis.

Stat Med. 2022 Jul 10;41(15):2786-2803. doi: 10.1002/sim.9384. Epub 2022 Apr 24.

Novel strategy for disease risk prediction incorporating predicted gene expression and DNA methylation data: a multi-phased study of prostate cancer.

Cancer Commun (Lond). 2021 Dec;41(12):1387-1397. doi: 10.1002/cac2.12205. Epub 2021 Sep 14.

An efficient and robust approach to Mendelian randomization with measured pleiotropic effects in a high-dimensional setting.

Biostatistics. 2022 Apr 13;23(2):609-625. doi: 10.1093/biostatistics/kxaa045.

Prediction of Radiosensitivity in Head and Neck Squamous Cell Carcinoma Based on Multiple Omics Data.

Front Genet. 2020 Aug 18;11:960. doi: 10.3389/fgene.2020.00960. eCollection 2020.

Implicating causal brain imaging endophenotypes in Alzheimer's disease using multivariable IWAS and GWAS summary data.

Neuroimage. 2020 Dec;223:117347. doi: 10.1016/j.neuroimage.2020.117347. Epub 2020 Sep 6.

本文引用的文献

Covariate-Adjusted Precision Matrix Estimation with an Application in Genetical Genomics.

Biometrika. 2013 Mar;100(1):139-156. doi: 10.1093/biomet/ass058. Epub 2012 Nov 30.

: Coordinate Descent With Nonconvex Penalties.

J Am Stat Assoc. 2011;106(495):1125-1138. doi: 10.1198/jasa.2011.tm09738.

On the robustness of the adaptive lasso to model misspecification.

Biometrika. 2012 Sep;99(3):717-731. doi: 10.1093/biomet/ass027. Epub 2012 Jul 11.

Sparse Multivariate Regression With Covariance Estimation.

J Comput Graph Stat. 2010 Fall;19(4):947-962. doi: 10.1198/jcgs.2010.09188.

Tissue specificity of genetic regulation of gene expression.

Nat Genet. 2012 Oct;44(10):1077-8. doi: 10.1038/ng.2420.

Patterns of cis regulatory variation in diverse human populations.

PLoS Genet. 2012;8(4):e1002639. doi: 10.1371/journal.pgen.1002639. Epub 2012 Apr 19.

Correcting for Population Stratification in Genomewide Association Studies.

J Am Stat Assoc. 2011 Sep 1;106(495):997-1008. doi: 10.1198/jasa.2011.tm10294.

Non-Concave Penalized Likelihood with NP-Dimensionality.

IEEE Trans Inf Theory. 2011 Aug;57(8):5467-5484. doi: 10.1109/TIT.2011.2158486.

Joint modelling of confounding factors and prominent genetic regulators provides increased accuracy in genetical genomics studies.

PLoS Comput Biol. 2012 Jan;8(1):e1002330. doi: 10.1371/journal.pcbi.1002330. Epub 2012 Jan 5.

The Mouse Genome Database (MGD): comprehensive resource for genetics and genomics of the laboratory mouse.

Nucleic Acids Res. 2012 Jan;40(Database issue):D881-6. doi: 10.1093/nar/gkr974. Epub 2011 Nov 10.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

高维工具变量回归的正则化方法及其在遗传基因组学中的应用

Regularization Methods for High-Dimensional Instrumental Variables Regression With an Application to Genetical Genomics.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献