Tian Xinyu, Wang Xuefeng, Chen Jun
Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY, USA.
Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY, USA. ; Department of Preventive Medicine, Stony Brook University, Stony Brook, NY, USA.
Cancer Inform. 2015 Jan 12;13(Suppl 6):25-33. doi: 10.4137/CIN.S17686. eCollection 2014.
Classic multinomial logit model, commonly used in multiclass regression problem, is restricted to few predictors and does not take into account the relationship among variables. It has limited use for genomic data, where the number of genomic features far exceeds the sample size. Genomic features such as gene expressions are usually related by an underlying biological network. Efficient use of the network information is important to improve classification performance as well as the biological interpretability. We proposed a multinomial logit model that is capable of addressing both the high dimensionality of predictors and the underlying network information. Group lasso was used to induce model sparsity, and a network-constraint was imposed to induce the smoothness of the coefficients with respect to the underlying network structure. To deal with the non-smoothness of the objective function in optimization, we developed a proximal gradient algorithm for efficient computation. The proposed model was compared to models with no prior structure information in both simulations and a problem of cancer subtype prediction with real TCGA (the cancer genome atlas) gene expression data. The network-constrained mode outperformed the traditional ones in both cases.
经典多项逻辑回归模型常用于多分类回归问题,它受限于预测变量较少,且未考虑变量之间的关系。对于基因组数据,其用途有限,因为基因组特征的数量远远超过样本量。诸如基因表达等基因组特征通常通过潜在的生物网络相互关联。有效利用网络信息对于提高分类性能以及生物学可解释性都很重要。我们提出了一种能够处理预测变量的高维度以及潜在网络信息的多项逻辑回归模型。使用组套索来诱导模型稀疏性,并施加网络约束以诱导系数相对于潜在网络结构的平滑性。为了处理优化中目标函数的非光滑性,我们开发了一种近端梯度算法以进行高效计算。在模拟以及使用真实的TCGA(癌症基因组图谱)基因表达数据进行癌症亚型预测的问题中,将所提出的模型与没有先验结构信息的模型进行了比较。在这两种情况下,网络约束模型均优于传统模型。