Université Paris-Saclay, Univ. Paris-Sud, UVSQ, CESP, INSERM U1018 Oncostat, Villejuif, F-94805, France.
Service de biostatistique et d'épidémiologie, Gustave Roussy, Villejuif, F-94805, France.
BMC Bioinformatics. 2020 Jul 2;21(1):277. doi: 10.1186/s12859-020-03618-y.
The standard lasso penalty and its extensions are commonly used to develop a regularized regression model while selecting candidate predictor variables on a time-to-event outcome in high-dimensional data. However, these selection methods focus on a homogeneous set of variables and do not take into account the case of predictors belonging to functional groups; typically, genomic data can be grouped according to biological pathways or to different types of collected data. Another challenge is that the standard lasso penalisation is known to have a high false discovery rate.
We evaluated different penalizations in a Cox model to select grouped variables in order to further penalize variables that, in addition to having a low effect, belong to a group with a low overall effect; and to favor the selection of variables that, in addition to having a large effect, belong to a group with a large overall effect. We considered the case of prespecified and disjoint groups and proposed diverse weights for the adaptive lasso method. In particular we proposed the product Max Single Wald by Single Wald weighting (MSW*SW) which takes into account the information of the group to which it belongs and of this biomarker. Through simulations, we compared the selection and prediction ability of our approach with the standard lasso, the composite Minimax Concave Penalty (cMCP), the group exponential lasso (gel), the Integrative L1-Penalized Regression with Penalty Factors (IPF-Lasso), and the Sparse Group Lasso (SGL) methods. In addition, we illustrated the methods using gene expression data of 614 breast cancer patients.
The adaptive lasso with the MSW*SW weighting method incorporates both the information in the grouping structure and the individual variable. It outperformed the competitors by reducing the false discovery rate without severely increasing the false negative rate.
标准套索惩罚及其扩展通常用于在高维数据中选择候选预测变量的时间事件结果的正则化回归模型。然而,这些选择方法侧重于同质变量集,并且不考虑属于功能组的预测器的情况; 通常,基因组数据可以根据生物途径或不同类型的收集数据进行分组。另一个挑战是标准套索惩罚已知具有高错误发现率。
我们在 Cox 模型中评估了不同的惩罚方法,以选择分组变量,以便进一步惩罚除了具有低效应之外还属于总体效应低的组的变量; 并有利于选择除了具有大效应之外还属于总体效应大的组的变量。我们考虑了预定和不相交的组的情况,并为自适应套索方法提出了不同的权重。特别是,我们提出了乘积最大单 Wald 由单 Wald 加权 (MSW*SW),它考虑了所属组和此生物标志物的信息。通过模拟,我们将我们的方法与标准套索、复合最小最大凹惩罚 (cMCP)、组指数套索 (gel)、具有惩罚因子的积分 L1 惩罚回归 (IPF-Lasso) 和稀疏组套索 (SGL) 方法的选择和预测能力进行了比较。此外,我们使用 614 名乳腺癌患者的基因表达数据说明了这些方法。
具有 MSW*SW 加权方法的自适应套索既包含分组结构中的信息,又包含单个变量的信息。它通过降低错误发现率而没有严重增加假阴性率来胜过竞争对手。