Kim Soyeon, Baladandayuthapani Veerabhadran, Lee J Jack
Department of Statistics, Rice University, Houston, TX, USA.
Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX, USA.
Stat Biosci. 2017 Jun;9(1):217-245. doi: 10.1007/s12561-016-9169-5. Epub 2016 Sep 26.
In personalized medicine, biomarkers are used to select therapies with the highest likelihood of success based on an individual patient's biomarker/genomic profile. Two goals are to choose important biomarkers that accurately predict treatment outcomes and to cull unimportant biomarkers to reduce the cost of biological and clinical verifications. These goals are challenging due to the high dimensionality of genomic data. Variable selection methods based on penalized regression (e.g., the lasso and elastic net) have yielded promising results. However, selecting the right amount of penalization is critical to simultaneously achieving these two goals. Standard approaches based on cross-validation (CV) typically provide high prediction accuracy with high true positive rates but at the cost of too many false positives. Alternatively, stability selection (SS) controls the number of false positives, but at the cost of yielding too few true positives. To circumvent these issues, we propose prediction-oriented marker selection (PROMISE), which combines SS with CV to conflate the advantages of both methods. Our application of PROMISE with the lasso and elastic net in data analysis shows that, compared to CV, PROMISE produces sparse solutions, few false positives, and small type I + type II error, and maintains good prediction accuracy, with a marginal decrease in the true positive rates. Compared to SS, PROMISE offers better prediction accuracy and true positive rates. In summary, PROMISE can be applied in many fields to select regularization parameters when the goals are to minimize false positives and maximize prediction accuracy.
在个性化医疗中,生物标志物用于根据个体患者的生物标志物/基因组概况选择最有可能成功的治疗方法。两个目标是选择能够准确预测治疗结果的重要生物标志物,并剔除不重要的生物标志物以降低生物学和临床验证的成本。由于基因组数据的高维度性,这些目标具有挑战性。基于惩罚回归的变量选择方法(例如套索回归和弹性网络)已取得了有前景的结果。然而,选择合适的惩罚量对于同时实现这两个目标至关重要。基于交叉验证(CV)的标准方法通常能提供高预测准确性和高真阳性率,但代价是出现过多的假阳性。另外,稳定性选择(SS)控制了假阳性的数量,但代价是真阳性数量过少。为了规避这些问题,我们提出了面向预测的标记选择(PROMISE),它将稳定性选择与交叉验证相结合,融合了两种方法的优点。我们将PROMISE与套索回归和弹性网络应用于数据分析表明,与交叉验证相比,PROMISE产生稀疏解、假阳性少、I型 + II型错误小,并保持良好的预测准确性,真阳性率略有下降。与稳定性选择相比,PROMISE提供了更好的预测准确性和真阳性率。总之,当目标是最小化假阳性并最大化预测准确性时,PROMISE可应用于许多领域来选择正则化参数。