通过使用Apache Spark进行分布式计算实现可扩展的单细胞转录组分析。

Enabling scalable single-cell transcriptomic analysis through distributed computing with Apache spark.

作者信息

Adil Asif, Bhattacharya Namrata, Khan Naveed Jeelani, Asger Mohammed

机构信息

Department of Computer Sciences, Baba Ghulam Shah Badshah University, Rajouri, India.

Department of Pathology and Laboratory Medicine, School of Medicine, Indiana University Indianapolis, Indianapolis, IN, USA.

出版信息

Sci Rep. 2025 Jul 29;15(1):27713. doi: 10.1038/s41598-025-12897-5.

DOI:10.1038/s41598-025-12897-5

PMID:40731055

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12307815/

Abstract

As the field of single-cell genomics continues to develop, the generation of large-scale scRNA-seq datasets has become more prevalent. Although these datasets offer tremendous potential for shedding light on the complex biology of individual cells, the sheer volume of data presents significant challenges for management and analysis. Off late, to address these challenges, a new discipline, known as "big single-cell data science," has emerged. Within this field, a variety of computational tools have been developed to facilitate the processing and interpretation of scRNA-seq data. However, several of these tools primarily focus on the analytical aspect and tend to overlook the burgeoning data deluge generated by scRNA-seq experiments. In this study, we try to address this challenge and present a novel parallel analytical framework, scSPARKL, that leverages the power of Apache Spark to enable the efficient analysis of single-cell transcriptomic data. scSPARKL is fortified by a rich set of staged algorithms developed to optimize the Apache Spark's work environment. The tool incorporates six key operations for dealing with single-cell Big Data, including data reshaping, data preprocessing, cell/gene filtering, data normalization, dimensionality reduction, and clustering. By utilizing Spark's unlimited scalability, fault tolerance, and parallelism, the tool enables researchers to rapidly and accurately analyze scRNA-seq datasets of any size. We demonstrate the utility of our framework and algorithms through a series of experiments on real-world scRNA-seq data. Overall, our results suggest that scSPARKL represents a powerful and flexible tool for the analysis of single-cell transcriptomic data, with broad applications across the fields of biology and medicine.

摘要

随着单细胞基因组学领域的不断发展，大规模单细胞RNA测序（scRNA-seq）数据集的生成变得越来越普遍。尽管这些数据集为揭示单个细胞的复杂生物学特性提供了巨大潜力，但数据量之大使管理和分析面临重大挑战。最近，为应对这些挑战，一门名为“大单细胞数据科学”的新学科应运而生。在这个领域中，已经开发了各种计算工具来促进scRNA-seq数据的处理和解释。然而，其中一些工具主要侧重于分析方面，往往忽视了scRNA-seq实验产生的迅速增长的数据洪流。在本研究中，我们试图应对这一挑战，并提出一种新颖的并行分析框架scSPARKL，它利用Apache Spark的强大功能实现对单细胞转录组数据的高效分析。scSPARKL通过一组丰富的分阶段算法得到强化，这些算法旨在优化Apache Spark的工作环境。该工具包含处理单细胞大数据的六个关键操作，包括数据重塑、数据预处理、细胞/基因过滤、数据归一化、降维和聚类。通过利用Spark无限的可扩展性、容错性和并行性，该工具使研究人员能够快速准确地分析任何规模的scRNA-seq数据集。我们通过对真实世界scRNA-seq数据进行一系列实验来证明我们框架和算法的实用性。总体而言，我们的结果表明scSPARKL是一种用于分析单细胞转录组数据的强大且灵活的工具，在生物学和医学领域具有广泛应用。