Machine Intelligence Unit, Indian Statistical Institute, 203 B.T. Road, Kolkata 700108, India.
IEEE/ACM Trans Comput Biol Bioinform. 2011 Jul-Aug;8(4):929-42. doi: 10.1109/TCBB.2010.106.
Two genes are said to be coexpressed if their expression levels have a similar spatial or temporal pattern. Ever since the profiling of gene microarrays has been in progress, computational modeling of coexpression has acquired a major focus. As a result, several similarity/distance measures have evolved over time to quantify coexpression similarity/dissimilarity between gene pairs. Of these, correlation coefficient has been established to be a suitable quantifier of pairwise coexpression. In general, correlation coefficient is good for symbolizing linear dependence, but not for nonlinear dependence. In spite of this drawback, it outperforms many other existing measures in modeling the dependency in biological data. In this paper, for the first time, we point out a significant weakness of the existing similarity/distance measures, including the standard correlation coefficient, in modeling pairwise coexpression of genes. A novel measure, called BioSim, which assumes values between -1 and +1 corresponding to negative and positive dependency and 0 for independency, is introduced. The computation of BioSim is based on the aggregation of stepwise relative angular deviation of the expression vectors considered. The proposed measure is analytically suitable for modeling coexpression as it accounts for the features of expression similarity, expression deviation and also the relative dependence. It is demonstrated how the proposed measure is better able to capture the degree of coexpression between a pair of genes as compared to several other existing ones. The efficacy of the measure is statistically analyzed by integrating it with several module-finding algorithms based on coexpression values and then applying it on synthetic and biological data. The annotation results of the coexpressed genes as obtained from gene ontology establish the significance of the introduced measure. By further extending the BioSim measure, it has been shown that one can effectively identify the variability in the expression patterns over multiple phenotypes. We have also extended BioSim to figure out pairwise differential expression pattern and coexpression dynamics. The significance of these studies is shown based on the analysis over several real-life data sets. The computation of the measure by focusing on stepwise time points also makes it effective to identify partially coexpressed genes. On the whole, we put forward a complete framework for coexpression analysis based on the BioSim measure.
如果两个基因的表达水平具有相似的时空模式,则称它们为共表达。自从基因微阵列的分析进展以来,共表达的计算建模已经成为一个主要焦点。因此,随着时间的推移,已经出现了几种相似性/距离度量标准来量化基因对之间的共表达相似性/差异性。在这些度量标准中,相关系数已被证明是衡量基因对共表达的合适量化标准。一般来说,相关系数擅长表示线性相关性,但不擅长表示非线性相关性。尽管存在这一缺点,但它在模拟生物数据中的相关性方面优于许多其他现有方法。在本文中,我们首次指出了现有相似性/距离度量标准(包括标准相关系数)在模拟基因对共表达方面的一个显著弱点。引入了一种新的度量标准,称为 BioSim,它的值在-1 到+1 之间,分别对应于负相关性和正相关性,而 0 表示独立性。BioSim 的计算基于所考虑的表达向量的逐步相对角度偏差的聚合。该方法在分析上适合于模拟共表达,因为它考虑了表达相似性、表达偏差以及相对依赖性的特征。与其他几种现有方法相比,该方法能够更好地捕捉基因对之间的共表达程度。通过将该方法与基于共表达值的几种模块发现算法集成,并将其应用于合成和生物数据,对该方法的有效性进行了统计分析。从基因本体获得的共表达基因的注释结果证明了引入的度量标准的重要性。通过进一步扩展 BioSim 度量标准,可以有效地识别多个表型中表达模式的可变性。我们还扩展了 BioSim 以找出基因对之间的差异表达模式和共表达动态。这些研究的意义基于对几个真实数据集的分析。通过关注逐步时间点进行度量的计算,也可以有效地识别部分共表达基因。总的来说,我们提出了一个基于 BioSim 度量标准的共表达分析完整框架。