Lin Wei, Feng Rui, Li Hongzhe
Department of Biostatistics and Epidemiology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104.
J Am Stat Assoc. 2015;110(509):270-288. doi: 10.1080/01621459.2014.908125.
In genetical genomics studies, it is important to jointly analyze gene expression data and genetic variants in exploring their associations with complex traits, where the dimensionality of gene expressions and genetic variants can both be much larger than the sample size. Motivated by such modern applications, we consider the problem of variable selection and estimation in high-dimensional sparse instrumental variables models. To overcome the difficulty of high dimensionality and unknown optimal instruments, we propose a two-stage regularization framework for identifying and estimating important covariate effects while selecting and estimating optimal instruments. The methodology extends the classical two-stage least squares estimator to high dimensions by exploiting sparsity using sparsity-inducing penalty functions in both stages. The resulting procedure is efficiently implemented by coordinate descent optimization. For the representative regularization and a class of concave regularization methods, we establish estimation, prediction, and model selection properties of the two-stage regularized estimators in the high-dimensional setting where the dimensionality of co-variates and instruments are both allowed to grow exponentially with the sample size. The practical performance of the proposed method is evaluated by simulation studies and its usefulness is illustrated by an analysis of mouse obesity data. Supplementary materials for this article are available online.
在遗传基因组学研究中,在探索基因表达数据和遗传变异与复杂性状的关联时,联合分析这两者非常重要,其中基因表达和遗传变异的维度都可能比样本量大得多。受此类现代应用的启发,我们考虑高维稀疏工具变量模型中的变量选择和估计问题。为了克服高维度和未知最优工具变量的困难,我们提出了一个两阶段正则化框架,用于在选择和估计最优工具变量的同时识别和估计重要协变量效应。该方法通过在两个阶段使用稀疏诱导惩罚函数利用稀疏性,将经典的两阶段最小二乘估计器扩展到高维。所得过程通过坐标下降优化有效地实现。对于代表性正则化和一类凹正则化方法,我们在高维设置中建立了两阶段正则化估计器的估计、预测和模型选择性质,其中协变量和工具变量的维度都允许随样本量呈指数增长。通过模拟研究评估了所提出方法的实际性能,并通过对小鼠肥胖数据的分析说明了其有用性。本文的补充材料可在线获取。