Xu Min, Zhu Mengxia, Zhang Louxin
Program in Molecular and Computational Biology, University of Southern California, Los Angeles, CA, USA.
BMC Genomics. 2008 Sep 16;9 Suppl 2(Suppl 2):S18. doi: 10.1186/1471-2164-9-S2-S18.
Microarray technology is often used to identify the genes that are differentially expressed between two biological conditions. On the other hand, since microarray datasets contain a small number of samples and a large number of genes, it is usually desirable to identify small gene subsets with distinct pattern between sample classes. Such gene subsets are highly discriminative in phenotype classification because of their tightly coupling features. Unfortunately, such identified classifiers usually tend to have poor generalization properties on the test samples due to overfitting problem.
We propose a novel approach combining both supervised learning with unsupervised learning techniques to generate increasingly discriminative gene clusters in an iterative manner. Our experiments on both simulated and real datasets show that our method can produce a series of robust gene clusters with good classification performance compared with existing approaches.
This backward approach for refining a series of highly discriminative gene clusters for classification purpose proves to be very consistent and stable when applied to various types of training samples.
微阵列技术常用于识别在两种生物学条件下差异表达的基因。另一方面,由于微阵列数据集包含少量样本和大量基因,通常希望识别出在样本类别之间具有独特模式的小基因子集。由于这些基因子集具有紧密耦合的特征,因此在表型分类中具有高度的判别力。不幸的是,由于过拟合问题,这样识别出的分类器在测试样本上通常倾向于具有较差的泛化性能。
我们提出了一种将监督学习与无监督学习技术相结合的新方法,以迭代方式生成具有越来越高判别力的基因簇。我们在模拟数据集和真实数据集上的实验表明,与现有方法相比,我们的方法可以产生一系列具有良好分类性能的稳健基因簇。
这种用于为分类目的细化一系列高判别力基因簇的反向方法在应用于各种类型的训练样本时被证明是非常一致和稳定的。