开发一个用于微阵列数据的分类器需要多大的训练集？

How large a training set is needed to develop a classifier for microarray data?

作者信息

Dobbin Kevin K, Zhao Yingdong, Simon Richard M

机构信息

Biometric Research Branch, Division of Cancer Treatment and Diagnosis, National Cancer Institute, NIH, Rockville, Maryland 20852, USA.

出版信息

Clin Cancer Res. 2008 Jan 1;14(1):108-14. doi: 10.1158/1078-0432.CCR-07-0443.

DOI:10.1158/1078-0432.CCR-07-0443

PMID:18172259

Abstract

PURPOSE

A common goal of gene expression microarray studies is the development of a classifier that can be used to divide patients into groups with different prognoses, or with different expected responses to a therapy. These types of classifiers are developed on a training set, which is the set of samples used to train a classifier. The question of how many samples are needed in the training set to produce a good classifier from high-dimensional microarray data is challenging.

EXPERIMENTAL DESIGN

We present a model-based approach to determining the sample size required to adequately train a classifier.

RESULTS

It is shown that sample size can be determined from three quantities: standardized fold change, class prevalence, and number of genes or features on the arrays. Numerous examples and important experimental design issues are discussed. The method is adapted to address ex post facto determination of whether the size of a training set used to develop a classifier was adequate. An interactive web site for performing the sample size calculations is provided.

CONCLUSION

We showed that sample size calculations for classifier development from high-dimensional microarray data are feasible, discussed numerous important considerations, and presented examples.

摘要