College of Life Sciences, Chongqing Normal University, Chongqing, 401331, People's Republic of China.
Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, Chongqing, 400065, People's Republic of China.
Genome Biol. 2024 Jun 3;25(1):145. doi: 10.1186/s13059-024-03290-y.
Single-cell RNA sequencing (scRNA-seq) and spatially resolved transcriptomics (SRT) have led to groundbreaking advancements in life sciences. To develop bioinformatics tools for scRNA-seq and SRT data and perform unbiased benchmarks, data simulation has been widely adopted by providing explicit ground truth and generating customized datasets. However, the performance of simulation methods under multiple scenarios has not been comprehensively assessed, making it challenging to choose suitable methods without practical guidelines.
We systematically evaluated 49 simulation methods developed for scRNA-seq and/or SRT data in terms of accuracy, functionality, scalability, and usability using 152 reference datasets derived from 24 platforms. SRTsim, scDesign3, ZINB-WaVE, and scDesign2 have the best accuracy performance across various platforms. Unexpectedly, some methods tailored to scRNA-seq data have potential compatibility for simulating SRT data. Lun, SPARSim, and scDesign3-tree outperform other methods under corresponding simulation scenarios. Phenopath, Lun, Simple, and MFA yield high scalability scores but they cannot generate realistic simulated data. Users should consider the trade-offs between method accuracy and scalability (or functionality) when making decisions. Additionally, execution errors are mainly caused by failed parameter estimations and appearance of missing or infinite values in calculations. We provide practical guidelines for method selection, a standard pipeline Simpipe ( https://github.com/duohongrui/simpipe ; https://doi.org/10.5281/zenodo.11178409 ), and an online tool Simsite ( https://www.ciblab.net/software/simshiny/ ) for data simulation.
No method performs best on all criteria, thus a good-yet-not-the-best method is recommended if it solves problems effectively and reasonably. Our comprehensive work provides crucial insights for developers on modeling gene expression data and fosters the simulation process for users.
单细胞 RNA 测序(scRNA-seq)和空间分辨转录组学(SRT)在生命科学领域取得了突破性进展。为了开发 scRNA-seq 和 SRT 数据的生物信息学工具并进行无偏基准测试,数据模拟已通过提供明确的真实数据并生成定制数据集而被广泛采用。然而,在多种情况下模拟方法的性能尚未得到全面评估,因此在没有实际指导的情况下选择合适的方法具有挑战性。
我们使用来自 24 个平台的 152 个参考数据集,系统地评估了 49 种针对 scRNA-seq 和/或 SRT 数据开发的模拟方法的准确性、功能、可扩展性和可用性。SRTsim、scDesign3、ZINB-WaVE 和 scDesign2 在各种平台上具有最佳的准确性表现。出乎意料的是,一些针对 scRNA-seq 数据定制的方法具有模拟 SRT 数据的潜在兼容性。在相应的模拟场景下,Lun、SPARSim 和 scDesign3-tree 优于其他方法。Phenopath、Lun、Simple 和 MFA 具有较高的可扩展性得分,但它们无法生成真实的模拟数据。用户在做出决策时应考虑方法准确性和可扩展性(或功能)之间的权衡。此外,执行错误主要是由于参数估计失败以及计算中出现缺失或无穷大值引起的。我们为方法选择提供了实用指南、标准流水线 Simpipe(https://github.com/duohongrui/simpipe;https://doi.org/10.5281/zenodo.11178409)和在线工具 Simsite(https://www.ciblab.net/software/simshiny/)用于数据模拟。
没有一种方法在所有标准上都表现最佳,因此如果一种方法能够有效地合理地解决问题,那么推荐使用一种好但不是最好的方法。我们的全面工作为开发人员提供了关于基因表达数据建模的重要见解,并促进了用户的模拟过程。