Lee Sangkyun, Rahnenführer Jörg, Lang Michel, De Preter Katleen, Mestdagh Pieter, Koster Jan, Versteeg Rogier, Stallings Raymond L, Varesio Luigi, Asgharzadeh Shahab, Schulte Johannes H, Fielitz Kathrin, Schwermer Melanie, Morik Katharina, Schramm Alexander
Department of Computer Sciences, TU Dortmund University, Dortmund, Germany.
Department of Statistics, TU Dortmund University, Dortmund, Germany.
PLoS One. 2014 Oct 8;9(10):e108818. doi: 10.1371/journal.pone.0108818. eCollection 2014.
Identifying relevant signatures for clinical patient outcome is a fundamental task in high-throughput studies. Signatures, composed of features such as mRNAs, miRNAs, SNPs or other molecular variables, are often non-overlapping, even though they have been identified from similar experiments considering samples with the same type of disease. The lack of a consensus is mostly due to the fact that sample sizes are far smaller than the numbers of candidate features to be considered, and therefore signature selection suffers from large variation. We propose a robust signature selection method that enhances the selection stability of penalized regression algorithms for predicting survival risk. Our method is based on an aggregation of multiple, possibly unstable, signatures obtained with the preconditioned lasso algorithm applied to random (internal) subsamples of a given cohort data, where the aggregated signature is shrunken by a simple thresholding strategy. The resulting method, RS-PL, is conceptually simple and easy to apply, relying on parameters automatically tuned by cross validation. Robust signature selection using RS-PL operates within an (external) subsampling framework to estimate the selection probabilities of features in multiple trials of RS-PL. These probabilities are used for identifying reliable features to be included in a signature. Our method was evaluated on microarray data sets from neuroblastoma, lung adenocarcinoma, and breast cancer patients, extracting robust and relevant signatures for predicting survival risk. Signatures obtained by our method achieved high prediction performance and robustness, consistently over the three data sets. Genes with high selection probability in our robust signatures have been reported as cancer-relevant. The ordering of predictor coefficients associated with signatures was well-preserved across multiple trials of RS-PL, demonstrating the capability of our method for identifying a transferable consensus signature. The software is available as an R package rsig at CRAN (http://cran.r-project.org).
识别与临床患者预后相关的特征是高通量研究中的一项基本任务。由mRNA、miRNA、SNP或其他分子变量等特征组成的特征集,即使它们是从考虑相同类型疾病样本的类似实验中识别出来的,通常也不重叠。缺乏一致性主要是因为样本量远小于要考虑的候选特征数量,因此特征选择存在很大差异。我们提出了一种稳健的特征选择方法,该方法增强了用于预测生存风险的惩罚回归算法的选择稳定性。我们的方法基于对通过应用于给定队列数据的随机(内部)子样本的预处理套索算法获得的多个可能不稳定的特征集进行聚合,其中聚合后的特征集通过简单的阈值策略进行收缩。由此产生的方法RS-PL在概念上简单且易于应用,依赖于通过交叉验证自动调整的参数。使用RS-PL进行稳健的特征选择在(外部)子采样框架内运行,以估计RS-PL多次试验中特征的选择概率。这些概率用于识别要包含在特征集中的可靠特征。我们的方法在神经母细胞瘤、肺腺癌和乳腺癌患者的微阵列数据集上进行了评估,提取了用于预测生存风险的稳健且相关的特征集。我们的方法获得的特征集在三个数据集上始终具有很高的预测性能和稳健性。我们稳健特征集中具有高选择概率的基因已被报道与癌症相关。与特征集相关的预测系数的排序在RS-PL的多次试验中得到了很好的保留,证明了我们的方法能够识别可转移的共识特征集。该软件可作为R包rsig在CRAN(http://cran.r-project.org)上获取。