Dobbin Kevin K, Zhao Yingdong, Simon Richard M
Biometric Research Branch, Division of Cancer Treatment and Diagnosis, National Cancer Institute, NIH, Rockville, Maryland 20852, USA.
Clin Cancer Res. 2008 Jan 1;14(1):108-14. doi: 10.1158/1078-0432.CCR-07-0443.
A common goal of gene expression microarray studies is the development of a classifier that can be used to divide patients into groups with different prognoses, or with different expected responses to a therapy. These types of classifiers are developed on a training set, which is the set of samples used to train a classifier. The question of how many samples are needed in the training set to produce a good classifier from high-dimensional microarray data is challenging.
We present a model-based approach to determining the sample size required to adequately train a classifier.
It is shown that sample size can be determined from three quantities: standardized fold change, class prevalence, and number of genes or features on the arrays. Numerous examples and important experimental design issues are discussed. The method is adapted to address ex post facto determination of whether the size of a training set used to develop a classifier was adequate. An interactive web site for performing the sample size calculations is provided.
We showed that sample size calculations for classifier development from high-dimensional microarray data are feasible, discussed numerous important considerations, and presented examples.
基因表达微阵列研究的一个共同目标是开发一种分类器,该分类器可用于将患者分为具有不同预后或对治疗有不同预期反应的组。这些类型的分类器是在训练集上开发的,训练集是用于训练分类器的样本集。从高维微阵列数据中确定训练集中需要多少样本才能产生一个良好的分类器是一个具有挑战性的问题。
我们提出了一种基于模型的方法来确定充分训练分类器所需的样本量。
结果表明,样本量可由三个量确定:标准化倍数变化、类别流行率以及阵列上的基因或特征数量。讨论了大量示例和重要的实验设计问题。该方法适用于事后确定用于开发分类器的训练集大小是否足够。提供了一个用于执行样本量计算的交互式网站。
我们表明,从高维微阵列数据进行分类器开发的样本量计算是可行的,讨论了许多重要的考虑因素,并给出了示例。