MRC Biostatistics Unit, University of Cambridge, Cambridge, UK.
Cambridge Institute of Therapeutic Immunology and Infectious Disease, University of Cambridge, Cambridge, UK.
BMC Bioinformatics. 2022 Jul 21;23(1):290. doi: 10.1186/s12859-022-04830-8.
Cluster analysis is an integral part of precision medicine and systems biology, used to define groups of patients or biomolecules. Consensus clustering is an ensemble approach that is widely used in these areas, which combines the output from multiple runs of a non-deterministic clustering algorithm. Here we consider the application of consensus clustering to a broad class of heuristic clustering algorithms that can be derived from Bayesian mixture models (and extensions thereof) by adopting an early stopping criterion when performing sampling-based inference for these models. While the resulting approach is non-Bayesian, it inherits the usual benefits of consensus clustering, particularly in terms of computational scalability and providing assessments of clustering stability/robustness.
In simulation studies, we show that our approach can successfully uncover the target clustering structure, while also exploring different plausible clusterings of the data. We show that, when a parallel computation environment is available, our approach offers significant reductions in runtime compared to performing sampling-based Bayesian inference for the underlying model, while retaining many of the practical benefits of the Bayesian approach, such as exploring different numbers of clusters. We propose a heuristic to decide upon ensemble size and the early stopping criterion, and then apply consensus clustering to a clustering algorithm derived from a Bayesian integrative clustering method. We use the resulting approach to perform an integrative analysis of three 'omics datasets for budding yeast and find clusters of co-expressed genes with shared regulatory proteins. We validate these clusters using data external to the analysis.
Our approach can be used as a wrapper for essentially any existing sampling-based Bayesian clustering implementation, and enables meaningful clustering analyses to be performed using such implementations, even when computational Bayesian inference is not feasible, e.g. due to poor exploration of the target density (often as a result of increasing numbers of features) or a limited computational budget that does not along sufficient samples to drawn from a single chain. This enables researchers to straightforwardly extend the applicability of existing software to much larger datasets, including implementations of sophisticated models such as those that jointly model multiple datasets.
聚类分析是精准医学和系统生物学的一个组成部分,用于定义患者或生物分子群体。共识聚类是一种广泛应用于这些领域的集成方法,它结合了多个非确定性聚类算法运行的输出结果。在这里,我们考虑将共识聚类应用于广泛的启发式聚类算法类别,这些算法可以通过在对这些模型进行基于采样的推断时采用早期停止标准,从贝叶斯混合模型(及其扩展)中导出。虽然得到的方法是非贝叶斯的,但它继承了共识聚类的通常好处,特别是在计算可扩展性方面,并提供了聚类稳定性/稳健性的评估。
在模拟研究中,我们表明我们的方法可以成功地揭示目标聚类结构,同时也探索了数据的不同可能聚类。我们表明,当有并行计算环境时,与对基础模型进行基于采样的贝叶斯推断相比,我们的方法提供了显著的运行时减少,同时保留了贝叶斯方法的许多实际好处,例如探索不同数量的聚类。我们提出了一种启发式方法来决定集成大小和早期停止标准,然后将共识聚类应用于从贝叶斯综合聚类方法导出的聚类算法。我们使用所得方法对芽殖酵母的三个“组学”数据集进行综合分析,找到具有共享调控蛋白的共表达基因簇。我们使用分析之外的数据验证了这些簇。
我们的方法可以用作基本上任何现有基于采样的贝叶斯聚类实现的包装器,并能够使用此类实现进行有意义的聚类分析,即使计算贝叶斯推断不可行,例如由于目标密度的探索不佳(通常是由于特征数量增加)或计算预算有限,无法从单个链中抽取足够的样本。这使研究人员能够直接将现有软件的适用性扩展到更大的数据集,包括联合建模多个数据集的复杂模型的实现。