Bøvelstad H M, Nygård S, Størvold H L, Aldrin M, Borgan Ø, Frigessi A, Lingjaerde O C
Department of Mathematics, University of Oslo, Norway.
Bioinformatics. 2007 Aug 15;23(16):2080-7. doi: 10.1093/bioinformatics/btm305. Epub 2007 Jun 6.
Survival prediction from gene expression data and other high-dimensional genomic data has been subject to much research during the last years. These kinds of data are associated with the methodological problem of having many more gene expression values than individuals. In addition, the responses are censored survival times. Most of the proposed methods handle this by using Cox's proportional hazards model and obtain parameter estimates by some dimension reduction or parameter shrinkage estimation technique. Using three well-known microarray gene expression data sets, we compare the prediction performance of seven such methods: univariate selection, forward stepwise selection, principal components regression (PCR), supervised principal components regression, partial least squares regression (PLS), ridge regression and the lasso.
Statistical learning from subsets should be repeated several times in order to get a fair comparison between methods. Methods using coefficient shrinkage or linear combinations of the gene expression values have much better performance than the simple variable selection methods. For our data sets, ridge regression has the overall best performance.
Matlab and R code for the prediction methods are available at http://www.med.uio.no/imb/stat/bmms/software/microsurv/.
在过去几年中,基于基因表达数据和其他高维基因组数据进行生存预测的研究颇多。这类数据存在一个方法学问题,即基因表达值的数量远多于个体数量。此外,响应变量是截尾生存时间。大多数提出的方法通过使用Cox比例风险模型来处理这个问题,并通过一些降维或参数收缩估计技术获得参数估计值。我们使用三个著名的微阵列基因表达数据集,比较了七种此类方法的预测性能:单变量选择、向前逐步选择、主成分回归(PCR)、监督主成分回归、偏最小二乘回归(PLS)、岭回归和套索回归。
为了在各方法之间进行公平比较,应多次从子集进行统计学习。使用系数收缩或基因表达值线性组合的方法比简单变量选择方法具有更好的性能。对于我们的数据集,岭回归具有总体最佳性能。
预测方法的Matlab和R代码可在http://www.med.uio.no/imb/stat/bmms/software/microsurv/获取。