Gao Bin, Liu Xu, Li Hongzhe, Cui Yuehua
Department of Statistics and Probability, Michigan State University, East Lansing, Michigan.
Quantitative Sciences, Janssen Research & Development, LLC, Spring House, Pennsylvania.
Biometrics. 2019 Dec;75(4):1063-1075. doi: 10.1111/biom.13072. Epub 2019 Apr 29.
In a living organism, tens of thousands of genes are expressed and interact with each other to achieve necessary cellular functions. Gene regulatory networks contain information on regulatory mechanisms and the functions of gene expressions. Thus, incorporating network structures, discerned either through biological experiments or statistical estimations, could potentially increase the selection and estimation accuracy of genes associated with a phenotype of interest. Here, we considered a gene selection problem using gene expression data and the graphical structures found in gene networks. Because gene expression measurements are intermediate phenotypes between a trait and its associated genes, we adopted an instrumental variable regression approach. We treated genetic variants as instrumental variables to address the endogeneity issue. We proposed a two-step estimation procedure. In the first step, we applied the LASSO algorithm to estimate the effects of genetic variants on gene expression measurements. In the second step, the projected expression measurements obtained from the first step were treated as input variables. A graph-constrained regularization method was adopted to improve the efficiency of gene selection and estimation. We theoretically showed the selection consistency of the estimation method and derived the bound of the estimates. Simulation and real data analyses were conducted to demonstrate the effectiveness of our method and to compare it with its counterparts.
在活生物体中,数以万计的基因被表达并相互作用以实现必要的细胞功能。基因调控网络包含有关调控机制和基因表达功能的信息。因此,纳入通过生物学实验或统计估计识别出的网络结构,可能会提高与感兴趣表型相关基因的选择和估计准确性。在此,我们考虑使用基因表达数据和基因网络中发现的图形结构来解决基因选择问题。由于基因表达测量是性状与其相关基因之间的中间表型,我们采用了工具变量回归方法。我们将遗传变异视为工具变量以解决内生性问题。我们提出了一种两步估计程序。第一步,我们应用LASSO算法来估计遗传变异对基因表达测量的影响。第二步,将第一步获得的预测表达测量值作为输入变量。采用图形约束正则化方法来提高基因选择和估计的效率。我们从理论上证明了估计方法的选择一致性,并推导了估计值的界。进行了模拟和实际数据分析以证明我们方法的有效性,并将其与其他方法进行比较。