Fang Gang, Kuang Rui, Pandey Gaurav, Steinbach Michael, Myers Chad L, Kumar Vipin
Department of Computer Science, University of Minnesota, Twin Cities, 200 Union Street SE, Minneapolis, MN 55455, USA.
Pac Symp Biocomput. 2010:145-56.
In this paper, we study methods to identify differential coexpression patterns in case-control gene expression data. A differential coexpression pattern consists of a set of genes that have substantially different levels of coherence of their expression profiles across the two sample-classes, i.e., highly coherent in one class, but not in the other. Biologically, a differential coexpression patterns may indicate the disruption of a regulatory mechanism possibly caused by disregulation of pathways or mutations of transcription factors. A common feature of all the existing approaches for differential coexpression analysis is that the coexpression of a set of genes is measured on all the samples in each of the two classes, i.e., over the full-space of samples. Hence, these approaches may miss patterns that only cover a subset of samples in each class, i.e., subspace patterns, due to the heterogeneity of the subject population and disease causes. In this paper, we extend differential coexpression analysis by defining a subspace differential coexpression pattern, i.e., a set of genes that are coexpressed in a relatively large percent of samples in one class, but in a much smaller percent of samples in the other class. We propose a general approach based upon association analysis framework that allows exhaustive yet efficient discovery of subspace differential coexpression patterns. This approach can be used to adapt a family of biclustering algorithms to obtain their corresponding differential versions that can directly discover differential coexpression patterns. Using a recently developed biclustering algorithm as illustration, we perform experiments on cancer datasets which demonstrates the existence of subspace differential coexpression patterns. Permutation tests demonstrate the statistical significance for a large number of discovered subspace patterns, many of which can not be discovered if they are measured over all the samples in each of the classes. Interestingly, in our experiments, some discovered subspace patterns have significant overlap with known cancer pathways, and some are enriched with the target gene sets of cancer-related microRNA and transcription factors. The source codes and datasets used in this paper are available at http://vk.cs.umn.edu/SDC/.
在本文中,我们研究了在病例对照基因表达数据中识别差异共表达模式的方法。差异共表达模式由一组基因组成,这些基因在两个样本类别中的表达谱具有显著不同的连贯水平,即在一个类别中高度连贯,而在另一个类别中则不然。从生物学角度来看,差异共表达模式可能表明调节机制受到破坏,这可能是由通路失调或转录因子突变引起的。所有现有差异共表达分析方法的一个共同特点是,一组基因的共表达是在两个类别中每个类别的所有样本上进行测量的,即在样本的全空间上进行测量。因此,由于研究对象群体和疾病病因的异质性,这些方法可能会错过仅覆盖每个类别中一部分样本的模式,即子空间模式。在本文中,我们通过定义子空间差异共表达模式来扩展差异共表达分析,即一组基因在一个类别中相对较大比例的样本中共表达,但在另一个类别中只有小得多比例的样本中共表达。我们提出了一种基于关联分析框架的通用方法,该方法允许详尽而高效地发现子空间差异共表达模式。这种方法可用于调整一类双聚类算法,以获得其相应的差异版本,从而直接发现差异共表达模式。以最近开发的一种双聚类算法为例,我们在癌症数据集上进行了实验,结果表明存在子空间差异共表达模式。置换检验证明了大量发现模式的统计显著性,如果在每个类别中的所有样本上进行测量,其中许多模式是无法发现的。有趣的是,在我们的实验中,一些发现的子空间模式与已知的癌症通路有显著重叠,并且一些模式富含癌症相关微小RNA和转录因子的靶基因集。本文使用的源代码和数据集可在http://vk.cs.umn.edu/SDC/获取。