Fu Li M, Youn Eun Seog
University of Florida, Gainesville, FL 32611, USA.
IEEE Trans Inf Technol Biomed. 2003 Sep;7(3):191-6. doi: 10.1109/titb.2003.816558.
Constructing a classifier based on microarray gene expression data has recently emerged as an important problem for cancer classification. Recent results have suggested the feasibility of constructing such a classifier with reasonable predictive accuracy under the circumstance where only a small number of cancer tissue samples of known type are available. Difficulty arises from the fact that each sample contains the expression data of a vast number of genes and these genes may interact with one another. Selection of a small number of critical genes is fundamental to correctly analyze the otherwise overwhelming data. It is essential to use a multivariate approach for capturing the correlated structure in the data. However, the curse of dimensionality leads to the concern about the reliability of selected genes. Here, we present a new gene selection method in which error and repeatability of selected genes are assessed within the context of M-fold cross-validation. In particular, we show that the method is able to identify source variables underlying data generation.
基于微阵列基因表达数据构建分类器最近已成为癌症分类中的一个重要问题。最近的结果表明,在仅有少量已知类型的癌组织样本可用的情况下,构建具有合理预测准确性的此类分类器是可行的。困难在于每个样本都包含大量基因的表达数据,并且这些基因可能相互作用。选择少量关键基因是正确分析原本海量数据的基础。使用多变量方法来捕捉数据中的相关结构至关重要。然而,维度诅咒引发了对所选基因可靠性的担忧。在此,我们提出一种新的基因选择方法,其中在M折交叉验证的背景下评估所选基因的误差和可重复性。特别是,我们表明该方法能够识别数据生成背后的源变量。