Applied Computational Intelligence Laboratory, Department of Electrical and Computer Engineering, Missouri University of Science and Technology, Rolla, MO 65409-0249, USA.
Artif Intell Med. 2010 Feb-Mar;48(2-3):91-8. doi: 10.1016/j.artmed.2009.06.001. Epub 2009 Dec 4.
The importance of gene expression data in cancer diagnosis and treatment has become widely known by cancer researchers in recent years. However, one of the major challenges in the computational analysis of such data is the curse of dimensionality because of the overwhelming number of variables measured (genes) versus the small number of samples. Here, we use a two-step method to reduce the dimension of gene expression data and aim to address the problem of high dimensionality.
First, we extract a subset of genes based on statistical characteristics of their corresponding gene expression levels. Then, for further dimensionality reduction, we apply diffusion maps, which interpret the eigenfunctions of Markov matrices as a system of coordinates on the original data set, in order to obtain efficient representation of data geometric descriptions. Finally, a neural network clustering theory, fuzzy ART, is applied to the resulting data to generate clusters of cancer samples.
Experimental results on the small round blue-cell tumor data set, compared with other widely used clustering algorithms, such as the hierarchical clustering algorithm and K-means, show that our proposed method can effectively identify different cancer types and generate high-quality cancer sample clusters.
The proposed feature selection methods and diffusion maps can achieve useful information from the multidimensional gene expression data and prove effective at addressing the problem of high dimensionality inherent in gene expression data analysis.
近年来,癌症研究人员已经广泛认识到基因表达数据在癌症诊断和治疗中的重要性。然而,此类数据的计算分析面临的主要挑战之一是维度灾难,因为所测量的变量(基因)数量极大,而样本数量却很少。在这里,我们使用两步法来降低基因表达数据的维度,并旨在解决高维问题。
首先,我们根据基因表达水平的统计特征提取基因子集。然后,为了进一步降维,我们应用扩散映射,将马尔可夫矩阵的本征函数解释为原始数据集上的坐标系,以便对数据的几何描述进行有效的表示。最后,应用神经网络聚类理论——模糊 ART 对得到的数据进行聚类,以生成癌症样本的聚类。
与其他广泛使用的聚类算法(如层次聚类算法和 K-means)相比,在小圆形蓝色细胞瘤数据集上的实验结果表明,我们提出的方法可以有效地识别不同的癌症类型,并生成高质量的癌症样本聚类。
所提出的特征选择方法和扩散映射可以从多维基因表达数据中获取有用信息,并有效地解决基因表达数据分析中固有的高维问题。