Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA.
Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA and Department of Data Science, Dana-Farber Cancer Institute, Boston, MA, USA.
Biostatistics. 2022 Oct 14;23(4):1150-1164. doi: 10.1093/biostatistics/kxac021.
Single-cell RNA sequencing (scRNA-seq) quantifies gene expression for individual cells in a sample, which allows distinct cell-type populations to be identified and characterized. An important step in many scRNA-seq analysis pipelines is the annotation of cells into known cell types. While this can be achieved using experimental techniques, such as fluorescence-activated cell sorting, these approaches are impractical for large numbers of cells. This motivates the development of data-driven cell-type annotation methods. We find limitations with current approaches due to the reliance on known marker genes or from overfitting because of systematic differences, or batch effects, between studies. Here, we present a statistical approach that leverages public data sets to combine information across thousands of genes, uses a latent variable model to define cell-type-specific barcodes and account for batch effect variation, and probabilistically annotates cell-type identity from a reference of known cell types. The barcoding approach also provides a new way to discover marker genes. Using a range of data sets, including those generated to represent imperfect real-world reference data, we demonstrate that our approach substantially outperforms current reference-based methods, particularly when predicting across studies.
单细胞 RNA 测序 (scRNA-seq) 可定量分析样本中单个细胞的基因表达情况,从而识别和描述不同的细胞类型。在许多 scRNA-seq 分析流程中,一个重要步骤是将细胞注释为已知的细胞类型。虽然可以使用荧光激活细胞分选等实验技术来实现这一目标,但对于大量细胞来说,这些方法并不实用。这促使人们开发了数据驱动的细胞类型注释方法。我们发现,由于依赖已知的标记基因,或者由于研究之间存在系统性差异(或批次效应)而导致过度拟合,当前方法存在局限性。在这里,我们提出了一种统计方法,该方法利用公共数据集来整合数千个基因的信息,使用潜在变量模型定义细胞类型特异性的条码,并解释批次效应变化,从已知细胞类型的参考中概率性地注释细胞类型身份。条码方法还提供了一种发现标记基因的新方法。使用一系列数据集,包括为代表不完美的真实世界参考数据而生成的数据集,我们证明了我们的方法显著优于当前基于参考的方法,特别是在跨研究预测时。