在高维但小样本的微阵列数据中验证预测分类器准确性时的错误。

Mistakes in validating the accuracy of a prediction classifier in high-dimensional but small-sample microarray data.

作者信息

Lee Sunho

机构信息

Department of Applied Mathematics, Sejong University, Seoul, South Korea.

出版信息

Stat Methods Med Res. 2008 Dec;17(6):635-42. doi: 10.1177/0962280207084839. Epub 2008 Mar 28.

DOI:10.1177/0962280207084839

PMID:18375459

Abstract

A major interest in gene expression microarray studies is to develop an accurate classifier which can be adopted in clinical practice. The usage of large numbers of genes with small data samples may lead to overfitting in classification, and generate promising, but often nonreproducible results. Therefore, assessing the reproducibility of a classifier is necessary. Appropriate methods for validating a developed classifier and estimating its predicting accuracy are discussed. In addition, some mistakes that can arise in the cross validation process are reviewed using published articles in prominent medical journals, to prevent the indefinite results of a classifier development from leading to inappropriate treatment.

摘要

基因表达微阵列研究的一个主要关注点是开发一种可应用于临床实践的精确分类器。在小数据样本中使用大量基因可能会导致分类中的过拟合，并产生看似有前景但往往不可重复的结果。因此，评估分类器的可重复性是必要的。本文讨论了验证已开发分类器并估计其预测准确性的适当方法。此外，还利用著名医学期刊上发表的文章回顾了交叉验证过程中可能出现的一些错误，以防止分类器开发的不确定结果导致不适当的治疗。