Department of Statistics, University of California, Los Angeles, CA 90095-1554, USA.
Department of Human Genetics, University of California, Los Angeles, CA 90095-7088, USA.
Bioinformatics. 2021 Jun 9;37(9):1225-1233. doi: 10.1093/bioinformatics/btaa741.
Gene clustering is a widely used technique that has enabled computational prediction of unknown gene functions within a species. However, it remains a challenge to refine gene function prediction by leveraging evolutionarily conserved genes in another species. This challenge calls for a new computational algorithm to identify gene co-clusters in two species, so that genes in each co-cluster exhibit similar expression levels in each species and strong conservation between the species.
Here, we develop the bipartite tight spectral clustering (BiTSC) algorithm, which identifies gene co-clusters in two species based on gene orthology information and gene expression data. BiTSC novelly implements a formulation that encodes gene orthology as a bipartite network and gene expression data as node covariates. This formulation allows BiTSC to adopt and combine the advantages of multiple unsupervised learning techniques: kernel enhancement, bipartite spectral clustering, consensus clustering, tight clustering and hierarchical clustering. As a result, BiTSC is a flexible and robust algorithm capable of identifying informative gene co-clusters without forcing all genes into co-clusters. Another advantage of BiTSC is that it does not rely on any distributional assumptions. Beyond cross-species gene co-clustering, BiTSC also has wide applications as a general algorithm for identifying tight node co-clusters in any bipartite network with node covariates. We demonstrate the accuracy and robustness of BiTSC through comprehensive simulation studies. In a real data example, we use BiTSC to identify conserved gene co-clusters of Drosophila melanogaster and Caenorhabditis elegans, and we perform a series of downstream analysis to both validate BiTSC and verify the biological significance of the identified co-clusters.
The Python package BiTSC is open-access and available at https://github.com/edensunyidan/BiTSC.
Supplementary data are available at Bioinformatics online.
基因聚类是一种广泛使用的技术,它使我们能够在一个物种内计算预测未知基因的功能。然而,利用另一个物种中进化保守的基因来完善基因功能预测仍然是一个挑战。这一挑战需要一种新的计算算法来识别两个物种中的基因共聚类,以使每个共聚类中的基因在两个物种中的表达水平相似,并且在两个物种之间具有很强的保守性。
在这里,我们开发了二分紧谱聚类(BiTSC)算法,该算法基于基因直系同源信息和基因表达数据来识别两个物种中的基因共聚类。BiTSC 新颖地实现了一种将基因直系同源编码为二分网络和节点协变量的公式。这种公式允许 BiTSC 采用和结合多种无监督学习技术的优势:核增强、二分谱聚类、共识聚类、紧聚类和层次聚类。因此,BiTSC 是一种灵活且强大的算法,能够识别信息丰富的基因共聚类,而无需将所有基因强制放入共聚类中。BiTSC 的另一个优势是它不依赖于任何分布假设。除了跨物种基因共聚类之外,BiTSC 还作为一种通用算法,具有广泛的应用,可用于识别具有节点协变量的任何二分网络中的紧节点共聚类。我们通过全面的模拟研究证明了 BiTSC 的准确性和鲁棒性。在一个真实数据的例子中,我们使用 BiTSC 来识别黑腹果蝇和秀丽隐杆线虫的保守基因共聚类,并进行了一系列下游分析,以验证 BiTSC 的准确性和验证所识别的共聚类的生物学意义。
Python 包 BiTSC 是开放访问的,可在 https://github.com/edensunyidan/BiTSC 上获得。
补充数据可在生物信息学在线获得。