Lunenfeld-Tanenbaum Research Institute, Toronto, Canada.
Program in Bioinformatics and Computational Biology, University of Toronto, Toronto, Canada.
Genome Biol. 2024 Jun 17;25(1):159. doi: 10.1186/s13059-024-03304-9.
The advent of single-cell RNA-sequencing (scRNA-seq) has driven significant computational methods development for all steps in the scRNA-seq data analysis pipeline, including filtering, normalization, and clustering. The large number of methods and their resulting parameter combinations has created a combinatorial set of possible pipelines to analyze scRNA-seq data, which leads to the obvious question: which is best? Several benchmarking studies compare methods but frequently find variable performance depending on dataset and pipeline characteristics. Alternatively, the large number of scRNA-seq datasets along with advances in supervised machine learning raise a tantalizing possibility: could the optimal pipeline be predicted for a given dataset?
Here, we begin to answer this question by applying 288 scRNA-seq analysis pipelines to 86 datasets and quantifying pipeline success via a range of measures evaluating cluster purity and biological plausibility. We build supervised machine learning models to predict pipeline success given a range of dataset and pipeline characteristics. We find that prediction performance is significantly better than random and that in many cases pipelines predicted to perform well provide clustering outputs similar to expert-annotated cell type labels. We identify characteristics of datasets that correlate with strong prediction performance that could guide when such prediction models may be useful.
Supervised machine learning models have utility for recommending analysis pipelines and therefore the potential to alleviate the burden of choosing from the near-infinite number of possibilities. Different aspects of datasets influence the predictive performance of such models which will further guide users.
单细胞 RNA 测序(scRNA-seq)的出现推动了 scRNA-seq 数据分析流程中所有步骤的计算方法的发展,包括过滤、归一化和聚类。大量的方法及其产生的参数组合创建了一个组合的 scRNA-seq 数据分析可能的管道,这就产生了一个明显的问题:哪种方法最好?有几项基准研究比较了方法,但经常发现性能因数据集和管道特征而异。或者,大量的 scRNA-seq 数据集以及监督机器学习的进步提出了一个诱人的可能性:是否可以为给定的数据集预测最佳的管道?
在这里,我们通过将 288 个 scRNA-seq 分析管道应用于 86 个数据集,并通过一系列评估聚类纯度和生物学合理性的措施来量化管道的成功,从而开始回答这个问题。我们构建了监督机器学习模型,以便根据一系列数据集和管道特征来预测管道的成功。我们发现,预测性能明显优于随机性能,并且在许多情况下,预测性能良好的管道提供的聚类输出与专家注释的细胞类型标签相似。我们确定了与强预测性能相关的数据集特征,这可以指导何时使用此类预测模型。
监督机器学习模型在推荐分析管道方面具有实用性,因此有可能减轻从近乎无限数量的可能性中进行选择的负担。数据集的不同方面影响了这些模型的预测性能,这将进一步指导用户。