Department of Computer Science, Emory University, 400 Dowman Drive, Atlanta, GA, 30322, USA.
Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, 1518 Clifton Road NE, Atlanta, GA, 30322, USA.
Genome Biol. 2021 Sep 9;22(1):264. doi: 10.1186/s13059-021-02480-2.
Cell type identification is one of the most important questions in single-cell RNA sequencing (scRNA-seq) data analysis. With the accumulation of public scRNA-seq data, supervised cell type identification methods have gained increasing popularity due to better accuracy, robustness, and computational performance. Despite all the advantages, the performance of the supervised methods relies heavily on several key factors: feature selection, prediction method, and, most importantly, choice of the reference dataset.
In this work, we perform extensive real data analyses to systematically evaluate these strategies in supervised cell identification. We first benchmark nine classifiers along with six feature selection strategies and investigate the impact of reference data size and number of cell types in cell type prediction. Next, we focus on how discrepancies between reference and target datasets and how data preprocessing such as imputation and batch effect correction affect prediction performance. We also investigate the strategies of pooling and purifying reference data.
Based on our analysis results, we provide guidelines for using supervised cell typing methods. We suggest combining all individuals from available datasets to construct the reference dataset and use multi-layer perceptron (MLP) as the classifier, along with F-test as the feature selection method. All the code used for our analysis is available on GitHub ( https://github.com/marvinquiet/RefConstruction_supervisedCelltyping ).
细胞类型鉴定是单细胞 RNA 测序 (scRNA-seq) 数据分析中最重要的问题之一。随着公共 scRNA-seq 数据的积累,由于具有更好的准确性、鲁棒性和计算性能,监督细胞类型鉴定方法越来越受欢迎。尽管具有所有这些优势,但监督方法的性能在很大程度上取决于几个关键因素:特征选择、预测方法,最重要的是,参考数据集的选择。
在这项工作中,我们进行了广泛的真实数据分析,以系统地评估这些策略在监督细胞识别中的作用。我们首先沿着六个特征选择策略,沿着九个分类器基准测试,并研究了参考数据大小和细胞类型数量对细胞类型预测的影响。接下来,我们重点研究了参考数据集和目标数据集之间的差异以及数据预处理(如插补和批次效应校正)如何影响预测性能。我们还研究了参考数据的汇总和纯化策略。
根据我们的分析结果,我们为使用监督细胞分型方法提供了指导。我们建议将来自可用数据集的所有个体组合在一起,以构建参考数据集,并使用多层感知机 (MLP) 作为分类器,同时使用 F 检验作为特征选择方法。我们分析中使用的所有代码都可在 GitHub 上获得(https://github.com/marvinquiet/RefConstruction_supervisedCelltyping)。