School of Computer Science and Engineering, South China University of Technology, B3 Building, Higher Education Megacenter, Panyu, Guangzhou City, China 510006.
IEEE/ACM Trans Comput Biol Bioinform. 2012 Nov-Dec;9(6):1751-65. doi: 10.1109/TCBB.2012.108.
In order to perform successful diagnosis and treatment of cancer, discovering, and classifying cancer types correctly is essential. One of the challenging properties of class discovery from cancer data sets is that cancer gene expression profiles not only include a large number of genes, but also contains a lot of noisy genes. In order to reduce the effect of noisy genes in cancer gene expression profiles, we propose two new consensus clustering frameworks, named as triple spectral clustering-based consensus clustering (SC3) and double spectral clustering-based consensus clustering (SC2Ncut) in this paper, for cancer discovery from gene expression profiles. SC3 integrates the spectral clustering (SC) algorithm multiple times into the ensemble framework to process gene expression profiles. Specifically, spectral clustering is applied to perform clustering on the gene dimension and the cancer sample dimension, and also used as the consensus function to partition the consensus matrix constructed from multiple clustering solutions.Compared with SC3, SC2Ncut adopts the normalized cut algorithm, instead of spectral clustering, as the consensus function.Experiments on both synthetic data sets and real cancer gene expression profiles illustrate that the proposed approaches not only achieve good performance on gene expression profiles, but also outperforms most of the existing approaches in the process of class discovery from these profiles.
为了成功地诊断和治疗癌症,正确地发现和分类癌症类型是至关重要的。从癌症数据集发现和分类癌症类型的一个具有挑战性的特点是,癌症基因表达谱不仅包含大量的基因,而且还包含许多嘈杂的基因。为了减少癌症基因表达谱中噪声基因的影响,我们在本文中提出了两种新的共识聚类框架,分别称为基于三重谱聚类的共识聚类(SC3)和基于双谱聚类的共识聚类(SC2Ncut),用于从基因表达谱中发现癌症。SC3 将谱聚类(SC)算法多次集成到集成框架中,以处理基因表达谱。具体来说,谱聚类被应用于对基因维度和癌症样本维度进行聚类,也被用作共识函数来划分从多个聚类解决方案构建的共识矩阵。与 SC3 相比,SC2Ncut 采用归一化割算法,而不是谱聚类作为共识函数。在合成数据集和真实癌症基因表达谱上的实验表明,所提出的方法不仅在基因表达谱上取得了良好的性能,而且在从这些谱中发现类的过程中也优于大多数现有的方法。