College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan, 410082, China.
College of Life Science, Northeast Forestry University, Harbin, Heilongjiang, 150000, China.
Comput Biol Med. 2022 Jul;146:105697. doi: 10.1016/j.compbiomed.2022.105697. Epub 2022 Jun 8.
Recent advances in single-cell RNA sequencing (scRNA-seq) provide exciting opportunities for transcriptome analysis at single-cell resolution. Clustering individual cells is a key step to reveal cell subtypes and infer cell lineage in scRNA-seq analysis. Although many dedicated algorithms have been proposed, clustering quality remains a computational challenge for scRNA-seq data, which is exacerbated by inflated zero counts due to various technical noise. To address this challenge, we assess the combinations of nine popular dropout imputation methods and eight clustering methods on a collection of 10 well-annotated scRNA-seq datasets with different sample sizes. Our results show that (i) imputation algorithms do typically improve the performance of clustering methods, and the quality of data visualization using t-Distributed Stochastic Neighbor Embedding; and (ii) the performance of a particular combination of imputation and clustering methods varies with dataset size. For example, the combination of single-cell analysis via expression recovery and Sparse Subspace Clustering (SSC) methods usually works well on smaller datasets, while the combination of adaptively-thresholded low-rank approximation and single-cell interpretation via multikernel learning (SIMLR) usually achieves the best performance on larger datasets.
单细胞 RNA 测序(scRNA-seq)的最新进展为单细胞分辨率的转录组分析提供了令人兴奋的机会。对单个细胞进行聚类是揭示细胞亚型和推断 scRNA-seq 分析中细胞谱系的关键步骤。尽管已经提出了许多专门的算法,但聚类质量仍然是 scRNA-seq 数据的计算挑战,由于各种技术噪声导致的膨胀零计数使该挑战更加严重。为了解决这个挑战,我们评估了 9 种流行的 dropout 插补方法和 8 种聚类方法在具有不同样本大小的 10 个注释良好的 scRNA-seq 数据集集合上的组合。我们的结果表明:(i)插补算法通常可以提高聚类方法的性能,并且使用 t 分布随机邻居嵌入进行数据可视化的质量得到了提高;(ii)插补和聚类方法的特定组合的性能随数据集大小而变化。例如,通过表达恢复和稀疏子空间聚类(SSC)方法进行单细胞分析的组合通常在较小的数据集上效果很好,而自适应阈值低秩逼近和通过多核学习进行单细胞解释(SIMLR)的组合通常在较大的数据集上实现最佳性能。