Department of Business Analytics and Statistics, University of Tennessee, Knoxville, 37996, TN, USA.
Division of Biostatistics, University of Minnesota, Minneapolis, 55455, MN, USA.
BMC Bioinformatics. 2020 Jul 3;21(1):283. doi: 10.1186/s12859-020-03606-2.
The problem of assessing associations between multiple omics data including genomics and metabolomics data to identify biomarkers potentially predictive of complex diseases has garnered considerable research interest nowadays. A popular epidemiology approach is to consider an association of each of the predictors with each of the response using a univariate linear regression model, and to select predictors that meet a priori specified significance level. Although this approach is simple and intuitive, it tends to require larger sample size which is costly. It also assumes variables for each data type are independent, and thus ignores correlations that exist between variables both within each data type and across the data types.
We consider a multivariate linear regression model that relates multiple predictors with multiple responses, and to identify multiple relevant predictors that are simultaneously associated with the responses. We assume the coefficient matrix of the responses on the predictors is both row-sparse and of low-rank, and propose a group Dantzig type formulation to estimate the coefficient matrix.
Extensive simulations demonstrate the competitive performance of our proposed method when compared to existing methods in terms of estimation, prediction, and variable selection. We use the proposed method to integrate genomics and metabolomics data to identify genetic variants that are potentially predictive of atherosclerosis cardiovascular disease (ASCVD) beyond well-established risk factors. Our analysis shows some genetic variants that increase prediction of ASCVD beyond some well-established factors of ASCVD, and also suggest a potential utility of the identified genetic variants in explaining possible association between certain metabolites and ASCVD.
如今,评估包括基因组学和代谢组学数据在内的多种组学数据之间的关联,以识别潜在预测复杂疾病的生物标志物的问题引起了相当多的研究兴趣。一种流行的流行病学方法是考虑使用单变量线性回归模型来评估每个预测因子与每个响应之间的关联,并选择满足先验指定显著水平的预测因子。尽管这种方法简单直观,但它往往需要更大的样本量,这是昂贵的。它还假设每个数据类型的变量是独立的,因此忽略了变量之间在每个数据类型内部和跨数据类型之间存在的相关性。
我们考虑了一种多元线性回归模型,该模型将多个预测因子与多个响应相关联,并识别出与响应同时相关的多个相关预测因子。我们假设响应对预测因子的系数矩阵既是行稀疏的又是低秩的,并提出了一种组 Dantzig 类型的公式来估计系数矩阵。
广泛的模拟表明,与现有方法相比,我们提出的方法在估计、预测和变量选择方面具有竞争力。我们使用所提出的方法整合基因组学和代谢组学数据,以识别除了已确立的风险因素之外,可能预测动脉粥样硬化性心血管疾病(ASCVD)的遗传变异。我们的分析表明,一些遗传变异增加了对 ASCVD 的预测,超过了一些已确立的 ASCVD 因素,并且还表明所识别的遗传变异在解释某些代谢物与 ASCVD 之间的可能关联方面具有潜在的效用。