School of Mathematics and Statistics, The University of Sydney, Sydney, NSW 2006, Australia.
Department of Anaesthesia, The University of Sydney Northern Clinical School, The University of Sydney, Sydney, NSW 2006, Australia.
Brief Bioinform. 2019 Nov 27;20(6):2316-2326. doi: 10.1093/bib/bby076.
Advances in high-throughput sequencing on single-cell gene expressions [single-cell RNA sequencing (scRNA-seq)] have enabled transcriptome profiling on individual cells from complex samples. A common goal in scRNA-seq data analysis is to discover and characterise cell types, typically through clustering methods. The quality of the clustering therefore plays a critical role in biological discovery. While numerous clustering algorithms have been proposed for scRNA-seq data, fundamentally they all rely on a similarity metric for categorising individual cells. Although several studies have compared the performance of various clustering algorithms for scRNA-seq data, currently there is no benchmark of different similarity metrics and their influence on scRNA-seq data clustering. Here, we compared a panel of similarity metrics on clustering a collection of annotated scRNA-seq datasets. Within each dataset, a stratified subsampling procedure was applied and an array of evaluation measures was employed to assess the similarity metrics. This produced a highly reliable and reproducible consensus on their performance assessment. Overall, we found that correlation-based metrics (e.g. Pearson's correlation) outperformed distance-based metrics (e.g. Euclidean distance). To test if the use of correlation-based metrics can benefit the recently published clustering techniques for scRNA-seq data, we modified a state-of-the-art kernel-based clustering algorithm (SIMLR) using Pearson's correlation as a similarity measure and found significant performance improvement over Euclidean distance on scRNA-seq data clustering. These findings demonstrate the importance of similarity metrics in clustering scRNA-seq data and highlight Pearson's correlation as a favourable choice. Further comparison on different scRNA-seq library preparation protocols suggests that they may also affect clustering performance. Finally, the benchmarking framework is available at http://www.maths.usyd.edu.au/u/SMS/bioinformatics/software.html.
高通量测序技术在单细胞基因表达(单细胞 RNA 测序(scRNA-seq))方面的进展使我们能够从复杂样本中对单个细胞进行转录组分析。scRNA-seq 数据分析的一个共同目标是发现和描述细胞类型,通常通过聚类方法。因此,聚类的质量在生物发现中起着至关重要的作用。虽然已经提出了许多用于 scRNA-seq 数据的聚类算法,但它们基本上都依赖于用于对单个细胞进行分类的相似性度量。尽管已经有几项研究比较了各种聚类算法在 scRNA-seq 数据上的性能,但目前还没有不同相似性度量及其对 scRNA-seq 数据聚类影响的基准。在这里,我们比较了一组相似性度量在对一组注释 scRNA-seq 数据集进行聚类的性能。在每个数据集内,应用分层子采样过程,并使用一系列评估指标来评估相似性度量。这产生了关于其性能评估的高度可靠和可重复的共识。总体而言,我们发现基于相关性的度量(例如皮尔逊相关系数)优于基于距离的度量(例如欧几里得距离)。为了测试基于相关性的度量是否可以使最近发布的 scRNA-seq 数据聚类技术受益,我们使用皮尔逊相关系数作为相似性度量来修改一种最先进的基于核的聚类算法(SIMLR),并发现其在 scRNA-seq 数据聚类方面显著优于欧几里得距离的性能提升。这些发现表明相似性度量在聚类 scRNA-seq 数据中的重要性,并突出了皮尔逊相关系数作为一个有利的选择。进一步比较不同的 scRNA-seq 文库制备方案表明,它们也可能影响聚类性能。最后,基准测试框架可在 http://www.maths.usyd.edu.au/u/SMS/bioinformatics/software.html 获得。