Suppr超能文献

Scellpam:一个用于在 scRNAseq 数据集上围绕质心进行并行分区的 R 包/C++ 库。

Scellpam: an R package/C++ library to perform parallel partitioning around medoids on scRNAseq data sets.

机构信息

Department of Informatics, ETSE, University of Valencia, Avda. de la Universidad, s/n, 46100, Burjassot, Valencia, Spain.

Department of Statistics and Operations Research, University of Valencia, Avda. Vicente Andres Estelles, 46100, Burjassot, Valencia, Spain.

出版信息

BMC Bioinformatics. 2023 Sep 14;24(1):342. doi: 10.1186/s12859-023-05471-1.

Abstract

BACKGROUND

Partitioning around medoids (PAM) is one of the most widely used and successful clustering method in many fields. One of its key advantages is that it only requires a distance or a dissimilarity between the individuals, and the fact that cluster centers are actual points in the data set means they can be taken as reliable representatives of their classes. However, its wider application is hampered by the large amount of memory needed to store the distance matrix (quadratic on the number of individuals) and also by the high computational cost of computing such distance matrix and, less importantly, by the cost of the clustering algorithm itself.

RESULTS

Therefore, new software has been provided that addresses these issues. This software, provided under GPL license and usable as either an R package or a C++ library, calculates in parallel the distance matrix for different distances/dissimilarities ([Formula: see text], [Formula: see text], Pearson, cosine and weighted Euclidean) and also implements a parallel fast version of PAM (FASTPAM1) using any data type to reduce memory usage. Moreover, the parallel implementation uses all the cores available in modern computers which greatly reduces the execution time. Besides its general application, the software is especially useful for processing data of single cell experiments. It has been tested in problems including clustering of single cell experiments with up to 289,000 cells with the expression of about 29,000 genes per cell.

CONCLUSIONS

Comparisons with other current packages in terms of execution time have been made. The method greatly outperforms the available R packages for distance matrix calculation and also improves the packages that implement the PAM itself. The software is available as an R package at https://CRAN.R-project.org/package=scellpam and as C++ libraries at https://github.com/JdMDE/jmatlib and https://github.com/JdMDE/ppamlib The package is useful for single cell RNA-seq studies but it is also applicable in other contexts where clustering of large data sets is required.

摘要

背景

基于中心点的划分(PAM)是在许多领域中应用最广泛、最成功的聚类方法之一。它的一个主要优点是只需要个体之间的距离或不相似性,而且聚类中心是数据集中的实际点,这意味着它们可以作为其类别的可靠代表。然而,由于需要存储距离矩阵(与个体数量的平方成正比),以及计算这种距离矩阵的计算成本高(不太重要的是,聚类算法本身的成本也很高),其更广泛的应用受到了阻碍。

结果

因此,提供了新的软件来解决这些问题。该软件在 GPL 许可证下提供,可作为 R 包或 C++库使用,可并行计算不同距离/不相似性的距离矩阵([公式:见正文]、[公式:见正文]、Pearson、余弦和加权欧几里得),并实现了一种使用任何数据类型来减少内存使用的并行快速 PAM(FASTPAM1)。此外,并行实现使用现代计算机中可用的所有核心,大大减少了执行时间。除了一般应用外,该软件特别适用于处理单细胞实验的数据。它已在包括多达 289000 个细胞和每个细胞约 29000 个基因表达的单细胞实验聚类等问题中进行了测试。

结论

已对执行时间等方面与其他当前软件包进行了比较。该方法大大优于现有的用于距离矩阵计算的 R 包,也改进了实现 PAM 本身的软件包。该软件可作为 R 包在 https://CRAN.R-project.org/package=scellpam 获得,也可作为 C++库在 https://github.com/JdMDE/jmatlibhttps://github.com/JdMDE/ppamlib 获得。该软件包可用于单细胞 RNA-seq 研究,但也可应用于需要对大型数据集进行聚类的其他情况。

相似文献

本文引用的文献

2
Single-cell transcriptomic analysis of endometriosis.子宫内膜异位症的单细胞转录组分析。
Nat Genet. 2023 Feb;55(2):255-267. doi: 10.1038/s41588-022-01254-1. Epub 2023 Jan 9.
4
Integrated analysis of multimodal single-cell data.多模态单细胞数据的综合分析。
Cell. 2021 Jun 24;184(13):3573-3587.e29. doi: 10.1016/j.cell.2021.04.048. Epub 2021 May 31.
6
Orchestrating single-cell analysis with Bioconductor.使用 Bioconductor 进行单细胞分析的协调。
Nat Methods. 2020 Feb;17(2):137-145. doi: 10.1038/s41592-019-0654-x. Epub 2019 Dec 2.
8
Comprehensive Integration of Single-Cell Data.单细胞数据的综合整合。
Cell. 2019 Jun 13;177(7):1888-1902.e21. doi: 10.1016/j.cell.2019.05.031. Epub 2019 Jun 6.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验