Suppr超能文献

使用微阵列数据进行疾病分类的逻辑回归:大p小n情况下的模型选择

Logistic regression for disease classification using microarray data: model selection in a large p and small n case.

作者信息

Liao J G, Chin Khew-Voon

机构信息

Drexel University School of Public Health, Philadelphia, PA 19102, USA.

出版信息

Bioinformatics. 2007 Aug 1;23(15):1945-51. doi: 10.1093/bioinformatics/btm287. Epub 2007 May 31.

Abstract

MOTIVATION

Logistic regression is a standard method for building prediction models for a binary outcome and has been extended for disease classification with microarray data by many authors. A feature (gene) selection step, however, must be added to penalized logistic modeling due to a large number of genes and a small number of subjects. Model selection for this two-step approach requires new statistical tools because prediction error estimation ignoring the feature selection step can be severely downward biased. Generic methods such as cross-validation and non-parametric bootstrap can be very ineffective due to the big variability in the prediction error estimate.

RESULTS

We propose a parametric bootstrap model for more accurate estimation of the prediction error that is tailored to the microarray data by borrowing from the extensive research in identifying differentially expressed genes, especially the local false discovery rate. The proposed method provides guidance on the two critical issues in model selection: the number of genes to include in the model and the optimal shrinkage for the penalized logistic regression. We show that selecting more than 20 genes usually helps little in further reducing the prediction error. Application to Golub's leukemia data and our own cervical cancer data leads to highly accurate prediction models.

AVAILABILITY

R library GeneLogit at http://geocities.com/jg_liao

摘要

动机

逻辑回归是构建二元结果预测模型的标准方法,许多作者已将其扩展用于利用微阵列数据进行疾病分类。然而,由于基因数量众多而样本数量较少,在惩罚逻辑建模中必须添加特征(基因)选择步骤。这种两步法的模型选择需要新的统计工具,因为忽略特征选择步骤的预测误差估计可能会严重向下偏差。由于预测误差估计的巨大变异性,诸如交叉验证和非参数自助法等通用方法可能非常无效。

结果

我们提出了一种参数自助模型,通过借鉴识别差异表达基因的广泛研究,特别是局部错误发现率,来更准确地估计适合微阵列数据的预测误差。所提出的方法为模型选择中的两个关键问题提供了指导:模型中包含的基因数量以及惩罚逻辑回归的最优收缩。我们表明,选择超过20个基因通常对进一步降低预测误差帮助不大。将其应用于Golub的白血病数据和我们自己的宫颈癌数据,得到了高度准确的预测模型。

可用性

R库GeneLogit可在http://geocities.com/jg_liao获取

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验