Dobbin Kevin K, Simon Richard M
Biometric Research Branch, National Cancer Institute, 6130 Executive Boulevard, Rockville, MD 20852, USA.
Biostatistics. 2007 Jan;8(1):101-17. doi: 10.1093/biostatistics/kxj036. Epub 2006 Apr 13.
Many gene expression studies attempt to develop a predictor of pre-defined diagnostic or prognostic classes. If the classes are similar biologically, then the number of genes that are differentially expressed between the classes is likely to be small compared to the total number of genes measured. This motivates a two-step process for predictor development, a subset of differentially expressed genes is selected for use in the predictor and then the predictor constructed from these. Both these steps will introduce variability into the resulting classifier, so both must be incorporated in sample size estimation. We introduce a methodology for sample size determination for prediction in the context of high-dimensional data that captures variability in both steps of predictor development. The methodology is based on a parametric probability model, but permits sample size computations to be carried out in a practical manner without extensive requirements for preliminary data. We find that many prediction problems do not require a large training set of arrays for classifier development.
许多基因表达研究试图开发一种针对预定义诊断或预后类别的预测指标。如果这些类别在生物学上相似,那么与所测量的基因总数相比,类别之间差异表达的基因数量可能较少。这促使了一种用于预测指标开发的两步法,即选择差异表达基因的一个子集用于预测指标,然后由这些基因构建预测指标。这两个步骤都会给最终的分类器引入变异性,因此在样本量估计中都必须予以考虑。我们引入了一种在高维数据背景下进行预测样本量确定的方法,该方法能够捕捉预测指标开发两个步骤中的变异性。该方法基于一个参数概率模型,但允许以一种实际可行的方式进行样本量计算,而无需对初步数据有大量要求。我们发现,许多预测问题在开发分类器时并不需要大量的阵列训练集。