Division of Biostatistics, School of Public Health, University of California, Berkeley, CA, USA.
Center for Computational Biology, University of California, Berkeley, CA, USA.
BMC Bioinformatics. 2024 May 24;25(1):198. doi: 10.1186/s12859-024-05814-6.
Single-cell transcriptome sequencing (scRNA-Seq) has allowed new types of investigations at unprecedented levels of resolution. Among the primary goals of scRNA-Seq is the classification of cells into distinct types. Many approaches build on existing clustering literature to develop tools specific to single-cell. However, almost all of these methods rely on heuristics or user-supplied parameters to control the number of clusters. This affects both the resolution of the clusters within the original dataset as well as their replicability across datasets. While many recommendations exist, in general, there is little assurance that any given set of parameters will represent an optimal choice in the trade-off between cluster resolution and replicability. For instance, another set of parameters may result in more clusters that are also more replicable.
Here, we propose Dune, a new method for optimizing the trade-off between the resolution of the clusters and their replicability. Our method takes as input a set of clustering results-or partitions-on a single dataset and iteratively merges clusters within each partitions in order to maximize their concordance between partitions. As demonstrated on multiple datasets from different platforms, Dune outperforms existing techniques, that rely on hierarchical merging for reducing the number of clusters, in terms of replicability of the resultant merged clusters as well as concordance with ground truth. Dune is available as an R package on Bioconductor: https://www.bioconductor.org/packages/release/bioc/html/Dune.html .
Cluster refinement by Dune helps improve the robustness of any clustering analysis and reduces the reliance on tuning parameters. This method provides an objective approach for borrowing information across multiple clusterings to generate replicable clusters most likely to represent common biological features across multiple datasets.
单细胞转录组测序(scRNA-Seq)允许在前所未有的分辨率水平上进行新类型的研究。scRNA-Seq 的主要目标之一是将细胞分类为不同的类型。许多方法基于现有的聚类文献来开发专门针对单细胞的工具。然而,几乎所有这些方法都依赖于启发式或用户提供的参数来控制聚类的数量。这会影响原始数据集中聚类的分辨率以及它们在数据集之间的可重复性。虽然有很多建议,但通常不能保证任何给定的参数集在聚类分辨率和可重复性之间的权衡中代表最佳选择。例如,另一组参数可能会导致更多的聚类,并且这些聚类也更具有可重复性。
在这里,我们提出了 Dune,一种用于优化聚类分辨率与其可重复性之间权衡的新方法。我们的方法以单个数据集上的一组聚类结果或分区作为输入,并在每个分区内迭代地合并聚类,以最大化分区之间的一致性。在来自不同平台的多个数据集上的演示表明,Dune 在可重复性和与真实情况的一致性方面,优于依赖层次聚类来减少聚类数量的现有技术。Dune 可作为 Bioconductor 上的 R 包使用:https://www.bioconductor.org/packages/release/bioc/html/Dune.html。
通过 Dune 进行聚类细化有助于提高任何聚类分析的稳健性,并减少对调整参数的依赖。该方法提供了一种跨多个聚类借用信息的客观方法,以生成最有可能代表多个数据集之间常见生物学特征的可重复聚类。