Dollinger Emmanuel, Silkwood Kai, Atwood Scott, Nie Qing, Lander Arthur D
Center for Complex Biological Systems, University of California, Irvine, Irvine, CA 92697.
Department of Developmental and Cell Biology, University of California, Irvine, Irvine, CA 92697.
bioRxiv. 2024 Oct 15:2024.10.11.617709. doi: 10.1101/2024.10.11.617709.
The high dimensionality of data in single cell transcriptomics (scRNAseq) requires investigators to choose subsets of genes (feature selection) for downstream analysis (e.g., unsupervised cell clustering). The evaluation of different approaches to feature selection is hampered by the fact that, as we show here, the performance of feature selection methods varies greatly with the task being performed. For routine cell type identification, even randomly chosen features can perform well, but for cell type differences that are subtle, both number of features and selection strategy can matter strongly. Here we present a simple feature selection method grounded in an analytical model that, without resorting to arbitrary thresholds or user-defined parameters, allows for interpretable delineation of both how many and which features to choose, facilitating identification of biologically meaningful rare cell types. We compare this method to default methods in scanpy and Seurat, as well as SCTransform, showing how greater accuracy can often be achieved with surprisingly few, well-chosen features.
单细胞转录组学(scRNAseq)中数据的高维度要求研究者选择基因子集(特征选择)用于下游分析(例如无监督细胞聚类)。正如我们在此所展示的,特征选择方法的性能会因所执行的任务而有很大差异,这一事实阻碍了对不同特征选择方法的评估。对于常规的细胞类型识别,即使是随机选择的特征也能表现良好,但对于细微的细胞类型差异,特征数量和选择策略都可能至关重要。在此,我们提出一种基于分析模型的简单特征选择方法,该方法无需借助任意阈值或用户定义的参数,就能对选择多少特征以及选择哪些特征进行可解释的描绘,有助于识别具有生物学意义的稀有细胞类型。我们将此方法与scanpy和Seurat中的默认方法以及SCTransform进行比较,展示了如何通过数量惊人少但精心选择的特征常常能实现更高的准确性。