Suppr超能文献

GMHCC:基于图的多重层次共识聚类的生物分子数据的高通量分析。

GMHCC: high-throughput analysis of biomolecular data using graph-based multiple hierarchical consensus clustering.

机构信息

School of Artificial Intelligence, Jilin University, Changchun 130012, China.

Department of Computer Science, City University of Hong Kong, Hong Kong 999077, Hong Kong SAR.

出版信息

Bioinformatics. 2022 May 26;38(11):3020-3028. doi: 10.1093/bioinformatics/btac290.

Abstract

MOTIVATION

Thanks to the development of high-throughput sequencing technologies, massive amounts of various biomolecular data have been accumulated to revolutionize the study of genomics and molecular biology. One of the main challenges in analyzing this biomolecular data is to cluster their subtypes into subpopulations to facilitate subsequent downstream analysis. Recently, many clustering methods have been developed to address the biomolecular data. However, the computational methods often suffer from many limitations such as high dimensionality, data heterogeneity and noise.

RESULTS

In our study, we develop a novel Graph-based Multiple Hierarchical Consensus Clustering (GMHCC) method with an unsupervised graph-based feature ranking (FR) and a graph-based linking method to explore the multiple hierarchical information of the underlying partitions of the consensus clustering for multiple types of biomolecular data. Indeed, we first propose to use a graph-based unsupervised FR model to measure each feature by building a graph over pairwise features and then providing each feature with a rank. Subsequently, to maintain the diversity and robustness of basic partitions (BPs), we propose multiple diverse feature subsets to generate several BPs and then explore the hierarchical structures of the multiple BPs by refining the global consensus function. Finally, we develop a new graph-based linking method, which explicitly considers the relationships between clusters to generate the final partition. Experiments on multiple types of biomolecular data including 35 cancer gene expression datasets and eight single-cell RNA-seq datasets validate the effectiveness of our method over several state-of-the-art consensus clustering approaches. Furthermore, differential gene analysis, gene ontology enrichment analysis and KEGG pathway analysis are conducted, providing novel insights into cell developmental lineages and characterization mechanisms.

AVAILABILITY AND IMPLEMENTATION

The source code is available at GitHub: https://github.com/yifuLu/GMHCC. The software and the supporting data can be downloaded from: https://figshare.com/articles/software/GMHCC/17111291.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

得益于高通量测序技术的发展,大量的各种生物分子数据已经积累起来,彻底改变了基因组学和分子生物学的研究。分析这些生物分子数据的主要挑战之一是将它们的亚型聚类为亚群,以方便后续的下游分析。最近,已经开发了许多聚类方法来解决生物分子数据的问题。然而,计算方法往往受到许多限制,如高维性、数据异质性和噪声。

结果

在我们的研究中,我们开发了一种新的基于图的多重层次共识聚类(GMHCC)方法,该方法具有基于图的无监督特征排序(FR)和基于图的链接方法,以探索共识聚类的底层分区的多个层次信息,适用于多种类型的生物分子数据。实际上,我们首先提出使用基于图的无监督 FR 模型,通过在成对特征上构建图并为每个特征提供一个秩来度量每个特征。随后,为了保持基本分区(BP)的多样性和稳健性,我们提出了多个不同的特征子集来生成多个 BP,然后通过细化全局共识函数来探索多个 BP 的层次结构。最后,我们开发了一种新的基于图的链接方法,该方法明确考虑了聚类之间的关系,以生成最终的分区。在包括 35 个癌症基因表达数据集和 8 个单细胞 RNA-seq 数据集在内的多种类型的生物分子数据上的实验验证了我们的方法相对于几种最先进的共识聚类方法的有效性。此外,还进行了差异基因分析、基因本体富集分析和 KEGG 通路分析,为细胞发育谱系和特征机制提供了新的见解。

可用性和实现

源代码可在 GitHub 上获得:https://github.com/yifuLu/GMHCC。软件和支持数据可从以下网址下载:https://figshare.com/articles/software/GMHCC/17111291。

补充信息

补充数据可在 Bioinformatics 在线获得。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验