距离度量对 t-SNE 嵌入置信度的影响及其对 scRNA-seq 数据聚类的影响。

Effect of distance measures on confidences of t-SNE embeddings and its implications on clustering for scRNA-seq data.

机构信息

Cognitive Sciences and Artificial Intelligence, Tilburg School of Humanities and Digital Sciences, Tilburg University, Warandelaan 2, 5037 AB, Tilburg, The Netherlands.

出版信息

Sci Rep. 2023 Apr 21;13(1):6567. doi: 10.1038/s41598-023-32966-x.

DOI:10.1038/s41598-023-32966-x

PMID:37085593

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10121641/

Abstract

Arguably one of the most famous dimensionality reduction algorithms of today is t-distributed stochastic neighbor embedding (t-SNE). Although being widely used for the visualization of scRNA-seq data, it is prone to errors as any algorithm and may lead to inaccurate interpretations of the visualized data. A reasonable way to avoid misinterpretations is to quantify the reliability of the visualizations. The focus of this work is first to find the best possible way to predict sample-based confidence scores for t-SNE embeddings and next, to use these confidence scores to improve the clustering algorithms. We adopt an RF regression algorithm using seven distance measures as features for having the sample-based confidence scores with a variety of different distance measures. The best configuration is used to assess the clustering improvement using K-means and Density-Based Spatial Clustering of Applications with Noise (DBSCAN) based on Adjusted Rank Index (ARI), Normalized Mutual Information (NMI), and accuracy (ACC) scores. The experimental results show that distance measures have a considerable effect on the precision of confidence scores and clustering performance can be improved substantially if these confidence scores are incorporated before the clustering algorithm. Our findings reveal the usefulness of these confidence scores on downstream analyses for scRNA-seq data.

摘要

可以说，当今最著名的降维算法之一是 t 分布随机邻居嵌入（t-SNE）。虽然它被广泛用于 scRNA-seq 数据的可视化，但与任何算法一样，它容易出错，并且可能导致对可视化数据的不准确解释。避免误解的合理方法是量化可视化的可靠性。这项工作的重点首先是找到预测 t-SNE 嵌入样本置信度得分的最佳方法，其次是使用这些置信度得分来改进聚类算法。我们采用 RF 回归算法，使用七种距离度量作为特征，使用多种不同的距离度量来获得基于样本的置信度得分。使用最佳配置根据调整后的秩指数（ARI）、归一化互信息（NMI）和准确性（ACC）分数，使用 K-means 和基于密度的空间聚类应用噪声（DBSCAN）评估聚类改进。实验结果表明，距离度量对置信度得分的精度有相当大的影响，如果在聚类算法之前使用这些置信度得分，聚类性能可以得到很大的提高。我们的发现揭示了这些置信度得分在 scRNA-seq 数据下游分析中的有用性。