Liao J G, Chin Khew-Voon
Drexel University School of Public Health, Philadelphia, PA 19102, USA.
Bioinformatics. 2007 Aug 1;23(15):1945-51. doi: 10.1093/bioinformatics/btm287. Epub 2007 May 31.
Logistic regression is a standard method for building prediction models for a binary outcome and has been extended for disease classification with microarray data by many authors. A feature (gene) selection step, however, must be added to penalized logistic modeling due to a large number of genes and a small number of subjects. Model selection for this two-step approach requires new statistical tools because prediction error estimation ignoring the feature selection step can be severely downward biased. Generic methods such as cross-validation and non-parametric bootstrap can be very ineffective due to the big variability in the prediction error estimate.
We propose a parametric bootstrap model for more accurate estimation of the prediction error that is tailored to the microarray data by borrowing from the extensive research in identifying differentially expressed genes, especially the local false discovery rate. The proposed method provides guidance on the two critical issues in model selection: the number of genes to include in the model and the optimal shrinkage for the penalized logistic regression. We show that selecting more than 20 genes usually helps little in further reducing the prediction error. Application to Golub's leukemia data and our own cervical cancer data leads to highly accurate prediction models.
R library GeneLogit at http://geocities.com/jg_liao
逻辑回归是构建二元结果预测模型的标准方法,许多作者已将其扩展用于利用微阵列数据进行疾病分类。然而,由于基因数量众多而样本数量较少,在惩罚逻辑建模中必须添加特征(基因)选择步骤。这种两步法的模型选择需要新的统计工具,因为忽略特征选择步骤的预测误差估计可能会严重向下偏差。由于预测误差估计的巨大变异性,诸如交叉验证和非参数自助法等通用方法可能非常无效。
我们提出了一种参数自助模型,通过借鉴识别差异表达基因的广泛研究,特别是局部错误发现率,来更准确地估计适合微阵列数据的预测误差。所提出的方法为模型选择中的两个关键问题提供了指导:模型中包含的基因数量以及惩罚逻辑回归的最优收缩。我们表明,选择超过20个基因通常对进一步降低预测误差帮助不大。将其应用于Golub的白血病数据和我们自己的宫颈癌数据,得到了高度准确的预测模型。
R库GeneLogit可在http://geocities.com/jg_liao获取