一种用于识别保守共调控基因的跨物种双聚类方法。

A cross-species bi-clustering approach to identifying conserved co-regulated genes.

作者信息

Sun Jiangwen, Jiang Zongliang, Tian Xiuchun, Bi Jinbo

机构信息

Department of Computer Science and Engineering.

Center for Regenerative Biology and Department of Animal Science, University of Connecticut, Storrs, CT 06269, USA.

出版信息

Bioinformatics. 2016 Jun 15;32(12):i137-i146. doi: 10.1093/bioinformatics/btw278.

DOI:10.1093/bioinformatics/btw278

PMID:27307610

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4908362/

Abstract

MOTIVATION

A growing number of studies have explored the process of pre-implantation embryonic development of multiple mammalian species. However, the conservation and variation among different species in their developmental programming are poorly defined due to the lack of effective computational methods for detecting co-regularized genes that are conserved across species. The most sophisticated method to date for identifying conserved co-regulated genes is a two-step approach. This approach first identifies gene clusters for each species by a cluster analysis of gene expression data, and subsequently computes the overlaps of clusters identified from different species to reveal common subgroups. This approach is ineffective to deal with the noise in the expression data introduced by the complicated procedures in quantifying gene expression. Furthermore, due to the sequential nature of the approach, the gene clusters identified in the first step may have little overlap among different species in the second step, thus difficult to detect conserved co-regulated genes.

RESULTS

We propose a cross-species bi-clustering approach which first denoises the gene expression data of each species into a data matrix. The rows of the data matrices of different species represent the same set of genes that are characterized by their expression patterns over the developmental stages of each species as columns. A novel bi-clustering method is then developed to cluster genes into subgroups by a joint sparse rank-one factorization of all the data matrices. This method decomposes a data matrix into a product of a column vector and a row vector where the column vector is a consistent indicator across the matrices (species) to identify the same gene cluster and the row vector specifies for each species the developmental stages that the clustered genes co-regulate. Efficient optimization algorithm has been developed with convergence analysis. This approach was first validated on synthetic data and compared to the two-step method and several recent joint clustering methods. We then applied this approach to two real world datasets of gene expression during the pre-implantation embryonic development of the human and mouse. Co-regulated genes consistent between the human and mouse were identified, offering insights into conserved functions, as well as similarities and differences in genome activation timing between the human and mouse embryos.

AVAILABILITY AND IMPLEMENTATION

The R package containing the implementation of the proposed method in C ++ is available at: https://github.com/JavonSun/mvbc.git and also at the R platform https://www.r-project.org/

CONTACT

jinbo@engr.uconn.edu.

摘要

动机

越来越多的研究探索了多种哺乳动物物种植入前胚胎发育的过程。然而，由于缺乏有效的计算方法来检测跨物种保守的共正则化基因，不同物种在其发育编程中的保守性和变异性尚未得到很好的界定。迄今为止，识别保守共调控基因的最复杂方法是两步法。该方法首先通过对基因表达数据进行聚类分析来识别每个物种的基因簇，随后计算从不同物种中识别出的簇的重叠情况，以揭示共同的亚组。这种方法在处理基因表达定量复杂过程中引入的表达数据噪声时效果不佳。此外，由于该方法的顺序性质，第一步中识别出的基因簇在第二步中不同物种之间可能几乎没有重叠，因此难以检测到保守的共调控基因。

结果

我们提出了一种跨物种双聚类方法，该方法首先将每个物种的基因表达数据去噪为一个数据矩阵。不同物种数据矩阵的行代表同一组基因，这些基因的特征是它们在每个物种发育阶段的表达模式作为列。然后开发了一种新颖的双聚类方法，通过对所有数据矩阵进行联合稀疏秩一分解将基因聚类为亚组。该方法将数据矩阵分解为一个列向量和一个行向量的乘积，其中列向量是跨矩阵（物种）的一致指标，用于识别相同的基因簇，而行向量为每个物种指定聚类基因共同调控的发育阶段。已经开发了具有收敛性分析的高效优化算法。该方法首先在合成数据上进行了验证，并与两步法和几种最近的联合聚类方法进行了比较。然后我们将此方法应用于人类和小鼠植入前胚胎发育期间基因表达的两个真实世界数据集。识别出了人类和小鼠之间一致的共调控基因，这为保守功能以及人类和小鼠胚胎基因组激活时间的异同提供了见解。

可用性和实现

包含用C++实现的所提出方法的R包可在以下网址获取：https://github.com/JavonSun/mvbc.git ，也可在R平台https://www.r-project.org/获取。

联系方式

jinbo@engr.uconn.edu

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4fb6/4908362/e873ad3f8ff1/btw278f1p.jpg

相似文献

A cross-species bi-clustering approach to identifying conserved co-regulated genes.

Bioinformatics. 2016 Jun 15;32(12):i137-i146. doi: 10.1093/bioinformatics/btw278.

Bipartite tight spectral clustering (BiTSC) algorithm for identifying conserved gene co-clusters in two species.

Bioinformatics. 2021 Jun 9;37(9):1225-1233. doi: 10.1093/bioinformatics/btaa741.

Robust clustering of noisy high-dimensional gene expression data for patients subtyping.

Bioinformatics. 2018 Dec 1;34(23):4064-4072. doi: 10.1093/bioinformatics/bty502.

Robust and sparse correlation matrix estimation for the analysis of high-dimensional genomics data.

Bioinformatics. 2018 Feb 15;34(4):625-634. doi: 10.1093/bioinformatics/btx642.

densityCut: an efficient and versatile topological approach for automatic clustering of biological data.

Bioinformatics. 2016 Sep 1;32(17):2567-76. doi: 10.1093/bioinformatics/btw227. Epub 2016 Apr 23.

Spectral clustering based on learning similarity matrix.

Bioinformatics. 2018 Jun 15;34(12):2069-2076. doi: 10.1093/bioinformatics/bty050.

Knowledge-assisted recognition of cluster boundaries in gene expression data.

Artif Intell Med. 2005 Sep-Oct;35(1-2):171-83. doi: 10.1016/j.artmed.2005.02.007.

pySAPC, a python package for sparse affinity propagation clustering: Application to odontogenesis whole genome time series gene-expression data.

Biochim Biophys Acta. 2016 Nov;1860(11 Pt B):2613-8. doi: 10.1016/j.bbagen.2016.06.008. Epub 2016 Jun 8.

Bi-correlation clustering algorithm for determining a set of co-regulated genes.

Bioinformatics. 2009 Nov 1;25(21):2795-801. doi: 10.1093/bioinformatics/btp526. Epub 2009 Sep 3.

Regulatory motif finding by logic regression.

Bioinformatics. 2004 Nov 1;20(16):2799-811. doi: 10.1093/bioinformatics/bth333. Epub 2004 May 27.

引用本文的文献

Molecular and cellular programs underlying the development of bovine pre-implantation embryos.

Reprod Fertil Dev. 2023 Dec;36(2):34-42. doi: 10.1071/RD23146.

Bipartite tight spectral clustering (BiTSC) algorithm for identifying conserved gene co-clusters in two species.

Bioinformatics. 2021 Jun 9;37(9):1225-1233. doi: 10.1093/bioinformatics/btaa741.

本文引用的文献

Defining the three cell lineages of the human blastocyst by single-cell RNA-seq.

Development. 2015 Sep 15;142(18):3151-65. doi: 10.1242/dev.123547. Epub 2015 Aug 20.

Transcriptional profiles of bovine in vivo pre-implantation development.

BMC Genomics. 2014 Sep 4;15(1):756. doi: 10.1186/1471-2164-15-756.

Multi-view singular value decomposition for disease subtyping and genetic associations.

BMC Genet. 2014 Jun 17;15:73. doi: 10.1186/1471-2156-15-73.

Fine mapping of genome activation in bovine embryos by RNA sequencing.

Proc Natl Acad Sci U S A. 2014 Mar 18;111(11):4139-44. doi: 10.1073/pnas.1321569111. Epub 2014 Mar 3.

Specific gene-regulation networks during the pre-implantation development of the pig embryo as revealed by deep sequencing.

BMC Genomics. 2014 Jan 3;15(1):4. doi: 10.1186/1471-2164-15-4.

Single-cell RNA-Seq profiling of human preimplantation embryos and embryonic stem cells.

Nat Struct Mol Biol. 2013 Sep;20(9):1131-9. doi: 10.1038/nsmb.2660. Epub 2013 Aug 11.

Genetic programs in human and mouse early embryos revealed by single-cell RNA sequencing.

Nature. 2013 Aug 29;500(7464):593-7. doi: 10.1038/nature12364. Epub 2013 Jul 28.

Biclustering via sparse singular value decomposition.

Biometrics. 2010 Dec;66(4):1087-95. doi: 10.1111/j.1541-0420.2010.01392.x.

Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources.

Nat Protoc. 2009;4(1):44-57. doi: 10.1038/nprot.2008.211.

WGCNA: an R package for weighted correlation network analysis.

BMC Bioinformatics. 2008 Dec 29;9:559. doi: 10.1186/1471-2105-9-559.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

一种用于识别保守共调控基因的跨物种双聚类方法。

A cross-species bi-clustering approach to identifying conserved co-regulated genes.

作者信息

Sun Jiangwen, Jiang Zongliang, Tian Xiuchun, Bi Jinbo

机构信息

Department of Computer Science and Engineering.

Center for Regenerative Biology and Department of Animal Science, University of Connecticut, Storrs, CT 06269, USA.