Department of Statistics, Operational Research and Numerical Analysis, University Nacional Educación a Distancia (UNED), Paseo Senda del Rey 9, 28040 Madrid, Spain.
Comput Biol Med. 2013 Oct;43(10):1437-43. doi: 10.1016/j.compbiomed.2013.07.005. Epub 2013 Jul 13.
An important issue in the analysis of gene expression microarray data is concerned with the extraction of valuable genetic interactions from high dimensional data sets containing gene expression levels collected for a small sample of assays. Past and ongoing research efforts have been focused on biomarker selection for phenotype classification. Usually, many genes convey useless information for classifying the outcome and should be removed from the analysis; on the other hand, some of them may be highly correlated, which reveals the presence of redundant expressed information. In this paper we propose a method for the selection of highly predictive genes having a low redundancy in their expression levels. The predictive accuracy of the selection is assessed by means of Classification and Regression Trees (CART) models which enable assessment of the performance of the selected genes for classifying the outcome variable and will also uncover complex genetic interactions. The method is illustrated throughout the paper using a public domain colon cancer gene expression data set.
基因表达微阵列数据分析中的一个重要问题涉及从包含针对小样本测定收集的基因表达水平的高维数据集提取有价值的遗传相互作用。过去和正在进行的研究工作都集中在生物标志物的选择用于表型分类。通常,许多基因对于分类结果传递无用的信息,应该从分析中删除;另一方面,其中一些可能高度相关,这表明存在冗余表达的信息。在本文中,我们提出了一种从其表达水平中具有低冗余度的高度预测基因中选择的方法。通过分类和回归树 (CART) 模型评估选择的预测准确性,该模型能够评估所选基因用于分类结果变量的性能,并揭示复杂的遗传相互作用。该方法使用公共领域的结肠癌基因表达数据集在整篇文章中进行说明。