Suppr超能文献

基于狄利克雷过程混合模型的分裂-合并抽样的单细胞转录组数据并行聚类。

Parallel clustering of single cell transcriptomic data with split-merge sampling on Dirichlet process mixtures.

机构信息

Department of Computer Science, University of California, Irvine, CA, USA.

SysBioLab, Centre for Biomedical Research (CBMR), University of Algarve, Faro, Algarve, Portugal.

出版信息

Bioinformatics. 2019 Mar 15;35(6):953-961. doi: 10.1093/bioinformatics/bty702.

Abstract

MOTIVATION

With the development of droplet based systems, massive single cell transcriptome data has become available, which enables analysis of cellular and molecular processes at single cell resolution and is instrumental to understanding many biological processes. While state-of-the-art clustering methods have been applied to the data, they face challenges in the following aspects: (i) the clustering quality still needs to be improved; (ii) most models need prior knowledge on number of clusters, which is not always available; (iii) there is a demand for faster computational speed.

RESULTS

We propose to tackle these challenges with Parallelized Split Merge Sampling on Dirichlet Process Mixture Model (the Para-DPMM model). Unlike classic DPMM methods that perform sampling on each single data point, the split merge mechanism samples on the cluster level, which significantly improves convergence and optimality of the result. The model is highly parallelized and can utilize the computing power of high performance computing (HPC) clusters, enabling massive inference on huge datasets. Experiment results show the model outperforms current widely used models in both clustering quality and computational speed.

AVAILABILITY AND IMPLEMENTATION

Source code is publicly available on https://github.com/tiehangd/Para_DPMM/tree/master/Para_DPMM_package.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

随着基于液滴的系统的发展,大量的单细胞转录组数据已经可用,这使得能够以单细胞分辨率分析细胞和分子过程,并且对于理解许多生物学过程是至关重要的。虽然已经将最先进的聚类方法应用于这些数据,但它们在以下方面面临挑战:(i)聚类质量仍需提高;(ii)大多数模型需要事先了解聚类的数量,但这并不总是可用的;(iii)需要更快的计算速度。

结果

我们建议使用基于狄利克雷过程混合模型的并行分裂合并采样(Para-DPMM 模型)来解决这些挑战。与在每个单个数据点上执行采样的经典 DPMM 方法不同,分裂合并机制在聚类级别上进行采样,这显著提高了结果的收敛性和最优性。该模型高度并行化,可以利用高性能计算(HPC)集群的计算能力,从而能够对大型数据集进行大规模推断。实验结果表明,该模型在聚类质量和计算速度方面均优于当前广泛使用的模型。

可用性和实现

源代码可在 https://github.com/tiehangd/Para_DPMM/tree/master/Para_DPMM_package 上获得。

补充信息

补充数据可在生物信息学在线获得。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验