Sun Jiangwen, Jiang Zongliang, Tian Xiuchun, Bi Jinbo
Department of Computer Science and Engineering.
Center for Regenerative Biology and Department of Animal Science, University of Connecticut, Storrs, CT 06269, USA.
Bioinformatics. 2016 Jun 15;32(12):i137-i146. doi: 10.1093/bioinformatics/btw278.
A growing number of studies have explored the process of pre-implantation embryonic development of multiple mammalian species. However, the conservation and variation among different species in their developmental programming are poorly defined due to the lack of effective computational methods for detecting co-regularized genes that are conserved across species. The most sophisticated method to date for identifying conserved co-regulated genes is a two-step approach. This approach first identifies gene clusters for each species by a cluster analysis of gene expression data, and subsequently computes the overlaps of clusters identified from different species to reveal common subgroups. This approach is ineffective to deal with the noise in the expression data introduced by the complicated procedures in quantifying gene expression. Furthermore, due to the sequential nature of the approach, the gene clusters identified in the first step may have little overlap among different species in the second step, thus difficult to detect conserved co-regulated genes.
We propose a cross-species bi-clustering approach which first denoises the gene expression data of each species into a data matrix. The rows of the data matrices of different species represent the same set of genes that are characterized by their expression patterns over the developmental stages of each species as columns. A novel bi-clustering method is then developed to cluster genes into subgroups by a joint sparse rank-one factorization of all the data matrices. This method decomposes a data matrix into a product of a column vector and a row vector where the column vector is a consistent indicator across the matrices (species) to identify the same gene cluster and the row vector specifies for each species the developmental stages that the clustered genes co-regulate. Efficient optimization algorithm has been developed with convergence analysis. This approach was first validated on synthetic data and compared to the two-step method and several recent joint clustering methods. We then applied this approach to two real world datasets of gene expression during the pre-implantation embryonic development of the human and mouse. Co-regulated genes consistent between the human and mouse were identified, offering insights into conserved functions, as well as similarities and differences in genome activation timing between the human and mouse embryos.
The R package containing the implementation of the proposed method in C ++ is available at: https://github.com/JavonSun/mvbc.git and also at the R platform https://www.r-project.org/
越来越多的研究探索了多种哺乳动物物种植入前胚胎发育的过程。然而,由于缺乏有效的计算方法来检测跨物种保守的共正则化基因,不同物种在其发育编程中的保守性和变异性尚未得到很好的界定。迄今为止,识别保守共调控基因的最复杂方法是两步法。该方法首先通过对基因表达数据进行聚类分析来识别每个物种的基因簇,随后计算从不同物种中识别出的簇的重叠情况,以揭示共同的亚组。这种方法在处理基因表达定量复杂过程中引入的表达数据噪声时效果不佳。此外,由于该方法的顺序性质,第一步中识别出的基因簇在第二步中不同物种之间可能几乎没有重叠,因此难以检测到保守的共调控基因。
我们提出了一种跨物种双聚类方法,该方法首先将每个物种的基因表达数据去噪为一个数据矩阵。不同物种数据矩阵的行代表同一组基因,这些基因的特征是它们在每个物种发育阶段的表达模式作为列。然后开发了一种新颖的双聚类方法,通过对所有数据矩阵进行联合稀疏秩一分解将基因聚类为亚组。该方法将数据矩阵分解为一个列向量和一个行向量的乘积,其中列向量是跨矩阵(物种)的一致指标,用于识别相同的基因簇,而行向量为每个物种指定聚类基因共同调控的发育阶段。已经开发了具有收敛性分析的高效优化算法。该方法首先在合成数据上进行了验证,并与两步法和几种最近的联合聚类方法进行了比较。然后我们将此方法应用于人类和小鼠植入前胚胎发育期间基因表达的两个真实世界数据集。识别出了人类和小鼠之间一致的共调控基因,这为保守功能以及人类和小鼠胚胎基因组激活时间的异同提供了见解。
包含用C++实现的所提出方法的R包可在以下网址获取:https://github.com/JavonSun/mvbc.git ,也可在R平台https://www.r-project.org/获取。