Datta Susmita, Le-Rademacher Jennifer, Datta Somnath
Department of Bioinformatics and Biostatistics, University of Louisville, Louisville, Kentucky 40202, USA.
Biometrics. 2007 Mar;63(1):259-71. doi: 10.1111/j.1541-0420.2006.00660.x.
We consider the problem of predicting survival times of cancer patients from the gene expression profiles of their tumor samples via linear regression modeling of log-transformed failure times. The partial least squares (PLS) and least absolute shrinkage and selection operator (LASSO) methodologies are used for this purpose where we first modify the data to account for censoring. Three approaches of handling right censored data-reweighting, mean imputation, and multiple imputation-are considered. Their performances are examined in a detailed simulation study and compared with that of full data PLS and LASSO had there been no censoring. A major objective of this article is to investigate the performances of PLS and LASSO in the context of microarray data where the number of covariates is very large and there are extremely few samples. We demonstrate that LASSO outperforms PLS in terms of prediction error when the list of covariates includes a moderate to large percentage of useless or noise variables; otherwise, PLS may outperform LASSO. For a moderate sample size (100 with 10,000 covariates), LASSO performed better than a no covariate model (or noise-based prediction). The mean imputation method appears to best track the performance of the full data PLS or LASSO. The mean imputation scheme is used on an existing data set on lung cancer. This reanalysis using the mean imputed PLS and LASSO identifies a number of genes that were known to be related to cancer or tumor activities from previous studies.
我们考虑通过对对数变换后的失效时间进行线性回归建模,从癌症患者肿瘤样本的基因表达谱预测其生存时间的问题。为此使用了偏最小二乘法(PLS)和最小绝对收缩与选择算子(LASSO)方法,其中我们首先对数据进行修改以考虑删失情况。考虑了处理右删失数据的三种方法——重新加权、均值插补和多重插补。在详细的模拟研究中检验了它们的性能,并与无删失情况下完整数据的PLS和LASSO的性能进行比较。本文的一个主要目标是研究PLS和LASSO在协变量数量非常大且样本极少的微阵列数据背景下的性能。我们证明,当协变量列表包含中等至较大比例的无用或噪声变量时,LASSO在预测误差方面优于PLS;否则,PLS可能优于LASSO。对于中等样本量(100个样本和10000个协变量),LASSO的表现优于无协变量模型(或基于噪声的预测)。均值插补方法似乎最能追踪完整数据PLS或LASSO的性能。均值插补方案应用于现有的肺癌数据集。使用均值插补后的PLS和LASSO进行的重新分析识别出了一些在先前研究中已知与癌症或肿瘤活动相关的基因。