SUMA：一种轻量级的、由机器学习模型驱动的、基于共享最近邻的scRNA-Seq数据聚类应用接口。

SUMA: a lightweight machine learning model-powered shared nearest neighbour-based clustering application interface for scRNA-Seq data.

作者信息

Karakurt Hamza Umut, Pir Pınar

机构信息

Department of Bioengineering, Faculty of Engineering, Gebze Technical University, Kocaeli, Turkiye.

Idea Technology Solutions R&D Center, İstanbul, Turkiye.

出版信息

Turk J Biol. 2023 Dec 18;47(6):413-422. doi: 10.55730/1300-0152.2675. eCollection 2023.

DOI:10.55730/1300-0152.2675

PMID:38681777

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11045205/

Abstract

BACKGROUND/AIM: Single-cell transcriptomics (scRNA-Seq) explores cellular diversity at the gene expression level. Due to the inherent sparsity and noise in scRNA-Seq data and the uncertainty on the types of sequenced cells, effective clustering and cell type annotation are essential. The graph-based clustering of scRNA-Seq data is a simple yet powerful approach that presents data as a "shared nearest neighbour" graph and clusters the cells using graph clustering algorithms. These algorithms are dependent on several user-defined parameters.Here we present SUMA, a lightweight tool that uses a random forest model to predict the optimum number of neighbours to obtain the optimum clustering results. Moreover, we integrated our method with other commonly used methods in an RShiny application. SUMA can be used in a local environment (https://github.com/hkarakurt8742/SUMA) or as a browser tool (https://hkarakurt.shinyapps.io/suma/).

MATERIALS AND METHODS

Publicly available scRNA-Seq datasets and 3 different graph-based clustering algorithms were used to develop SUMA, and a large range for number of neighbours and variant genes was taken into consideration. The quality of clustering was assessed using the adjusted Rand index (ARI) and true labels of each dataset. The data were split into training and test datasets, and the model was built and optimised using Scikit-learn (Python) and randomForest (R) libraries.

RESULTS

The accuracy of our machine learning model was 0.96, while the AUC of the ROC curve was 0.98. The model indicated that the number of cells in scRNA-Seq data is the most important feature when deciding the number of neighbours.

CONCLUSION

We developed and evaluated the SUMA model and implemented the method in the SUMAShiny app, which integrates SUMA with different clustering methods and enables nonbioinformatician users to cluster and visualise their scRNA data easily. The SUMAShiny app is available both for desktop and browser use.

摘要

背景/目的：单细胞转录组学（scRNA-Seq）在基因表达水平上探索细胞多样性。由于scRNA-Seq数据固有的稀疏性和噪声以及测序细胞类型的不确定性，有效的聚类和细胞类型注释至关重要。基于图的scRNA-Seq数据聚类是一种简单而强大的方法，它将数据呈现为“共享最近邻”图，并使用图聚类算法对细胞进行聚类。这些算法依赖于几个用户定义的参数。在此，我们展示了SUMA，这是一个轻量级工具，它使用随机森林模型来预测邻居的最佳数量以获得最佳聚类结果。此外，我们在一个RShiny应用程序中将我们的方法与其他常用方法进行了整合。SUMA可在本地环境（https://github.com/hkarakurt8742/SUMA）中使用，也可作为浏览器工具（https://hkarakurt.shinyapps.io/suma/）使用。

材料与方法

使用公开可用的scRNA-Seq数据集和3种不同的基于图的聚类算法来开发SUMA，并考虑了邻居数量和变异基因的大范围。使用调整后的兰德指数（ARI）和每个数据集的真实标签评估聚类质量。数据被分为训练和测试数据集，并使用Scikit-learn（Python）和randomForest（R）库构建和优化模型。