IEEE/ACM Trans Comput Biol Bioinform. 2019 Mar-Apr;16(2):352-364. doi: 10.1109/TCBB.2017.2705686. Epub 2017 May 18.
Microarray technology enables the collection of vast amounts of gene expression data from biological experiments. Clustering algorithms have been successfully applied to exploring the gene expression data. Since a set of genes may be only correlated to a subset of samples, it is useful to use co-clustering to recover co-clusters in the gene expression data. In this paper, we propose a novel algorithm, called Subspace Weighting Co-Clustering (SWCC), for high dimensional gene expression data. In SWCC, a gene subspace weight matrix is introduced to identify the contribution of gene objects in distinguishing different sample clusters. We design a new co-clustering objective function to recover the co-clusters in the gene expression data, in which the subspace weight matrix is introduced. An iterative algorithm is developed to solve the objective function, in which the subspace weight matrix is automatically computed during the iterative co-clustering process. Our empirical study shows encouraging results of the proposed algorithm in comparison with six state-of-the-art clustering algorithms on ten gene expression data sets. We also propose to use SWCC for gene clustering and selection. The experimental results show that the selected genes can improve the classification performance of Random Forests.
微阵列技术能够从生物实验中收集大量的基因表达数据。聚类算法已成功应用于探索基因表达数据。由于一组基因可能仅与样本的一个子集相关,因此使用共聚类来恢复基因表达数据中的共聚类是很有用的。在本文中,我们提出了一种新的算法,称为子空间加权共聚类(SWCC),用于高维基因表达数据。在 SWCC 中,引入了一个基因子空间权重矩阵来识别基因对象在区分不同样本聚类中的贡献。我们设计了一个新的共聚类目标函数来恢复基因表达数据中的共聚类,其中引入了子空间权重矩阵。开发了一种迭代算法来求解目标函数,其中在迭代共聚类过程中自动计算子空间权重矩阵。我们的实验研究表明,与十种基因表达数据集上的六种最先进的聚类算法相比,所提出的算法具有令人鼓舞的结果。我们还建议使用 SWCC 进行基因聚类和选择。实验结果表明,所选基因可以提高随机森林的分类性能。