Pavan Kenny, Saunders Arpiar
bioRxiv. 2025 Mar 22:2024.11.02.621676. doi: 10.1101/2024.11.02.621676.
As single-cell genomics technologies continue to accelerate biological discovery, software tools that use elegant syntax and minimal computational resources to analyze atlas-scale datasets are increasingly needed. Here we introduce AnnSQL, a Python package that constructs an AnnData-inspired database using the in-process DuckDb engine, enabling orders-of-magnitude performance enhancements for parsing single-cell genomics datasets with the ease of SQL. We highlight AnnSQL functionality and demonstrate transformative runtime improvements by comparing AnnData or AnnSQL operations on a 4.4 million cell single-nucleus RNA-seq dataset: AnnSQL-based operations were executed in minutes on a laptop for which equivalent AnnData operations largely failed (or were ∼700x slower) on a high-performance computing cluster. AnnSQL lowers computational barriers for large-scale single-cell/nucleus RNA-seq analysis on a personal computer, while demonstrating a promising computational infrastructure extendable for complete single-cell workflows across various genome-wide measurements.
AnnSQL is a pip installable package that can be found at https://github.com/ArpiarSaundersLab/annsql along with documentation at https://docs.annsql.com .
随着单细胞基因组学技术不断加速生物学发现,越来越需要使用简洁语法和最少计算资源来分析图谱规模数据集的软件工具。在此,我们介绍AnnSQL,这是一个Python包,它使用进程内DuckDb引擎构建受AnnData启发的数据库,从而在使用SQL的便捷性的同时,实现解析单细胞基因组学数据集时性能提升几个数量级。我们突出展示了AnnSQL的功能,并通过比较在一个440万个细胞的单核RNA测序数据集上的AnnData或AnnSQL操作,展示了变革性的运行时改进:基于AnnSQL的操作在笔记本电脑上只需几分钟即可执行,而等效的AnnData操作在高性能计算集群上大多失败(或慢约700倍)。AnnSQL降低了在个人计算机上进行大规模单细胞/细胞核RNA测序分析的计算障碍,同时展示了一种有前景的计算基础设施,可扩展用于跨各种全基因组测量的完整单细胞工作流程。
AnnSQL是一个可通过pip安装的包,可在https://github.com/ArpiarSaundersLab/annsql找到,其文档位于https://docs.annsql.com 。